PDA

View Full Version : Bulldozer x264 optimisations


Vapor
10-04-2011, 03:49 PM
Ladies and Gentlemen, I give you, Pure Awesome
http://www.xtremesystems.org/forums/showthread.php?265710-AMD-Zambezi-news-info-fans-!&p=4964335&viewfull=1#post4964335

wuttz
10-04-2011, 04:02 PM
it's from dresdenboy.

2011-10-04 04:46:38 < Dark_Shikari> C, with mode analysis shortcuts: 253 cycles
2011-10-04 04:46:45 < Dark_Shikari> My crappy, badly optimized XOP asm: 93 cycles
2011-10-04 04:46:56 < Dark_Shikari> This is kinda awesome
2011-10-04 04:49:35 < Dark_Shikari> Oh, and old without shortcuts: 379 cycles
2011-10-04 04:49:45 < Dark_Shikari> My asm is 4 times faster than the existing... wait where have we seen this before? XD
2011-10-04 04:49:57 < Dark_Shikari> It's just like SAD_4x4_x9 all over again!
2011-10-04 04:50:10 < JEEB> :D
2011-10-04 04:50:18 < JEEB> that sounds pretty awesome
2011-10-04 04:50:21 < Dark_Shikari> Except this time I'm still wondering how best to do it without vpperm
2011-10-04 04:50:33 < Dark_Shikari> Thanks AMD, for bringing back the best instruction ever after 15+ years of hiatus.

saw it earlier here (http://www.amdzone.com/phpbb3/viewtopic.php?f=532&t=138556&p=211821#p211816).

The crappy badly optimized XOP variant already shows improvements of about 400% ("asm is 4 times faster than the existing"). Nice!

Poisoner
10-04-2011, 04:11 PM
Man I just got the weirdest boner ever.

wuttz
10-04-2011, 04:12 PM
XOP wipes the floor with IPC.

Vapor
10-04-2011, 04:17 PM
Going by this TomsHardware Article: http://www.tomshardware.com/reviews/video-transcoding-amd-app-nvidia-cuda-intel-quicksync,2839-7.html

(Software encode vs Quicksync numbers)

This optimisation has a serious opportunity of bringing Bulldozer' x264 encode speeds, past Quicksync. In other words, the best encoder in the business becomes faster than the fastest fixed function encoder in the business. That's ridiculous !!

wuttz
10-04-2011, 04:21 PM
wait, pardon my last comment.
it should read;

XOP wipes the floor with QuickSync.

Vapor
10-04-2011, 04:31 PM
err my bad, my last post was completely off the wall (must be a bit too excited !!) I haven't found a proper example because different files/bitrates/formats are used between the quicksync x264 benches but at least the following is normalised by Frames per second.

http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7-2600k-i5-2500k-core-i3-2100-tested/9
http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7-2600k-i5-2500k-core-i3-2100-tested/16

Also please keep in mind, x264 is just encoding as opposed to transcoding.

Vapor
10-04-2011, 05:05 PM
and for added context
http://www.behardware.com/art/imprimer/828/

hyc
10-04-2011, 06:42 PM
err my bad, my last post was completely off the wall (must be a bit too excited !!) I haven't found a proper example because different files/bitrates/formats are used between the quicksync x264 benches but at least the following is normalised by Frames per second.

http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7-2600k-i5-2500k-core-i3-2100-tested/9
http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7-2600k-i5-2500k-core-i3-2100-tested/16

Also please keep in mind, x264 is just encoding as opposed to transcoding.

x264 does decoding too. For transcoding use ffmpeg (which uses libx264). Dark Shikari is also on the ffmpeg development team. (So am I.)

Copper
10-04-2011, 08:15 PM
x264 does decoding too. For transcoding use ffmpeg (which uses libx264). Dark Shikari is also on the ffmpeg development team. (So am I.)

I'm sorry to do this in public ***hug***

Thank you. ffmpeg I loves.

wuttz
10-04-2011, 09:09 PM
(So am I.)

youtube sends you thanks (http://multimedia.cx/eggs/googles-youtube-uses-ffmpeg/). for all the money they leech off the opensource community, they should buy you;

1. a new tesla roadster;
2. a decent 5-bed/4-bath in Malibu, CA;
3. and free AT&T DSL internet access for as long as you live. :D:p:p

BioSehnsucht
10-04-2011, 09:21 PM
I think you meant free Verizon FiOS. AT&T DSL ? I wouldn't wish that upon my worst enemies...

DarthShader
10-04-2011, 10:14 PM
Ladies and Gentlemen, I give you, Pure Awesome
http://www.xtremesystems.org/forums/showthread.php?265710-AMD-Zambezi-news-info-fans-!&p=4964335&viewfull=1#post4964335
waitwaitwaitwaitwaitwait....

BD has move elimination???????? :eek: :eek: :eek: :eek: :eek:\

(first quote from that post)

Vapor
10-04-2011, 10:24 PM
x264 does decoding too. For transcoding use ffmpeg (which uses libx264). Dark Shikari is also on the ffmpeg development team. (So am I.)

Man, can't believe I missed that. Granted it's no surprise since I just close my eye's and throw VLC on any machine I'm running. I'm not going to get into ripping and encoding until I put a serious rig together, but when I do x264 was the only thing I had in mind after I saw the quality.

Sooooo....How much sway do you have over everyone at ffmpeg & Doom9 :D. You think Shikari's "results" are enough to bring everyone around to the XOP Bandwagon, I'd be interested to see what Madshi can come up with. But nowhere near as interested as the prospect of a full ffmpeg lib stack optimisation for Bulldozer. I'm hoping the leap in performance is enough to keep all you guys (as coders) seriously interested ?

BioSehnsucht
10-05-2011, 12:44 AM
waitwaitwaitwaitwaitwait....

BD has move elimination???????? :eek: :eek: :eek: :eek: :eek:\

(first quote from that post)

That's... awesome.

Seriously, excessive register moves are one of the two major problems with x86 architecture (along with not having enough registers, which is one of the main reasons you have so many moves to begin with ... ).

DarthShader
10-05-2011, 09:22 AM
And move elimination comes to Ivy Bridge too: http://www.anandtech.com/show/4830/intels-ivy-bridge-architecture-exposed/2

But BD would be first there? :eek: Having an only two wide pipeline + AGLUs would make more sense now!

CRoland
10-05-2011, 09:33 AM
I kind of had the idea that VPPERM would be massively good news for encoders. Have to remember these 400% improvements and the like are for some low level primitives. I wouldn't expect quite that kind of improvement for the overall process.

redpriest
10-05-2011, 03:58 PM
I thought that was public? ^_^

CRoland
10-05-2011, 05:09 PM
I thought that was public? ^_^

Move elimination? Yeah. (http://forums.anandtech.com/showthread.php?p=31371640&highlight=#post31371640)

hyc
10-05-2011, 05:40 PM
Sooooo....How much sway do you have over everyone at ffmpeg & Doom9 :D. You think Shikari's "results" are enough to bring everyone around to the XOP Bandwagon, I'd be interested to see what Madshi can come up with. But nowhere near as interested as the prospect of a full ffmpeg lib stack optimisation for Bulldozer. I'm hoping the leap in performance is enough to keep all you guys (as coders) seriously interested ?

I'm just one among many. I seem to be pretty effective at annoying folks from time to time. ;)

ffmpeg devs are pretty agnostic - anything that improves performance is always good.

DarthShader
10-05-2011, 08:05 PM
Move elimination? Yeah. (http://forums.anandtech.com/showthread.php?p=31371640&highlight=#post31371640)
So it's only for the FPU?

Vapor
10-05-2011, 08:17 PM
I'm just one among many. I seem to be pretty effective at annoying folks from time to time. ;)

ffmpeg devs are pretty agnostic - anything that improves performance is always good.

I expected as much. Tis the way of the open source after all. :) I've sort of expected you guys to be slow on the uptake, it seems like everyone is running a core2duo and a 9600gt or some such. DarkShikari got his system shipped to him, so it's sort of a weird waiting game for developers to upgrade their systems. Since you hippies don't exactly have bank, I feel like it's going to be a while :D

As for annoying people, PM me if you want a masterclass, you'll get a certificate and everything. Ask Copper, she'll vouch for the quality of my teachings :p

Melkhior
10-06-2011, 10:13 AM
I kind of had the idea that VPPERM would be massively good news for encoders.

Which is why AltiVec has had vperm since day one (1999 IIRC).

It only took 12 years to AMD to catch up. I wonder how long it will take Intel (and no, SSSE3's PSHUFB doesn't count, it's missing one operand).

In fact, VPPERM has a nice bonus compared to vperm: the use of the extra 3 bits for some extra transformations. But then, I just wonder how they will extend that to the AVX 256-bits registers now that all 8 bits in each bytes are already significant...

Poisoner
10-06-2011, 02:17 PM
Vapor, no way Copper is a chick. That would change everything.

Vapor
10-07-2011, 01:24 AM
As much as each and every bone in my body is aching to ask for a point by point breakdown of what exactly it would change, I think it's best we set a good example to other forum posters by keeping on topic. Besides all the forum mods are chicks....ahem ladies :p . There's Grandma with a Glock, an Axe, a Golock, I believe its Semiaccurate's signature, nuanced, Freudian approach to forum moderation, that works on many different mental level's.

Back on topic now.

I just realised that Handbrake is pretty much updated annually, this put's a pretty big damper on the proceeding's, since it's probably the most popular opensource transcoding benchmark for the x264 codec. We're probably not going to see a favorable handbrake test till piledriver at best.

CRoland
10-07-2011, 02:11 AM
I just realised that Handbrake is pretty much updated annually, this put's a pretty big damper on the proceeding's, since it's probably the most popular opensource transcoding benchmark for the x264 codec. We're probably not going to see a favorable handbrake test till piledriver at best.

It probably bundles its own version of x264? This is precisely the problem that shared libraries are supposed to solve.

kac
10-07-2011, 03:13 AM
It probably bundles its own version of x264? This is precisely the problem that shared libraries are supposed to solve.

I thought for Linux that Handbrake relies on gstreamer for it's H264 libraries. Handbrake itself isn't updated but once a month but the gstreamer libraries are updated far more often than that.

If you compiled your own version I don't see why you wouldn't see a benefit.

CRoland
10-07-2011, 05:12 AM
I thought for Linux that Handbrake relies on gstreamer for it's H264 libraries.

Doesn't seem like it:

$ dpkg --status handbrake-cli | grep Depends
Depends: libbz2-1.0, libc6 (>= 2.7), libgcc1 (>= 1:4.1.1), libstdc++6 (>= 4.5), zlib1g (>= 1:1.2.3.3.dfsg)

kac
10-07-2011, 06:29 AM
Doesn't seem like it:

$ dpkg --status handbrake-cli | grep Depends
Depends: libbz2-1.0, libc6 (>= 2.7), libgcc1 (>= 1:4.1.1), libstdc++6 (>= 4.5), zlib1g (>= 1:1.2.3.3.dfsg)

Ah figured out it's being used for the preview window within the gui.

Kedas
10-09-2011, 07:54 AM
To keep the thread up to date XOP is out.

2011-10-05 01:56:49 < pengvado> but avx is still faster than xop overall
-snip-
2011-10-05 02:02:41 < Dark_Shikari> Okay, so I'll dump the xop parts, I think

CRoland
10-09-2011, 10:00 AM
To keep the thread up to date XOP is out.

Is the context available somewhere? Is that about a specific routine or the whole thing?

muziqaz
10-09-2011, 10:00 AM
But FMA4 is still in?

By the way, x264 has its own benchmark app which is used from time to time by some websites ;)

Exophase
10-09-2011, 11:10 AM
So it's only for the FPU?

Yes. You can see it in the optimization guide: movapd/movaps have zero latency while movq et al have the usual 2 latency for integer operations. And of course the integer mov reg, reg has a single cycle latency, not zero, so the integer execution layout was not done with move elimination in mind.

Move elimination is a bigger deal for integer because AVX can already be used to perform three-operand arithmetic with implicit moves.

On the topic of the 400% fpperm speedup, I see it as a major red herring and something people may be getting too excited over. Of course when you compare a C code version for something that's very unlikely to be vectorized like permutes you'll get a huge speedup. Problem is that the rest of the critical parts of the codebase aren't C, and the relevant comparison is going to be against aggressively optimized ASM with the next best SIMD ISA, not C. I don't know the context of the code, but a lot of what vpperm does can be emulated in far fewer SSE4.x instructions than scalar code.

DarthShader
10-09-2011, 12:44 PM
Yes. You can see it in the optimization guide: movapd/movaps have zero latency while movq et al have the usual 2 latency for integer operations. And of course the integer mov reg, reg has a single cycle latency, not zero, so the integer execution layout was not done with move elimination in mind.

Move elimination is a bigger deal for integer because AVX can already be used to perform three-operand arithmetic with implicit moves.
Thanks for the information. :) I wonder if and how fast it could be implemented in future versions of BD, since I think it would particulary benefit the architecture. IB is already getting it next year....

Exophase
10-09-2011, 12:56 PM
Thanks for the information. :) I wonder if and how fast it could be implemented in future versions of BD, since I think it would particulary benefit the architecture. IB is already getting it next year....

Could be in Piledriver for all we know! I wonder when we'll start getting some info on what improvements it'll have.. Maybe not too long after Zambezi is out..

CRoland
10-12-2011, 08:57 AM
BD optimized x264 results. (http://www.tomshardware.com/reviews/fx-8150-zambezi-bulldozer-990fx,3043-5.html)

XOP and AVX versions have almost the same speed. I wonder about the specifics of those two versions and if some combination of them would do better. Anyone have the source?

Edit: BTW, total FPS after both passes:

BD AVX: 29.3
BD XOP: 29.2
SB AVX: 25.0