AMD’s Polaris is finally here and SemiAccurate brings you the details of this upgrade. On the surface it looks fairly similar to its predecessors but there are a lot of important changes under the hood.
On a macro level, Polaris 10 aka Radeon RX 480 looks much the same as past GCN generations but the bump from 3rd to 4th generation involves substantial changes. None of these are game changing headline features, pun intended, but they all add up to a substantially better part than anything that came before it. Lets start with the obligatory block diagram.
Only a few minor changes at this level
Most of this looks like what you have seen before with two main exceptions, the L2 cache and the blocks labeled HWS. L2 has been bumped up to 2MB which adds a lot to efficiency and memory bandwidth efficiency, two themes you will see more of later on. It also brings a lot to the table for GPU compute and smoothing of frame rates which is useful in VR settings.
Speaking of smoother VR and compute, the other main addition are the two HWS or Hardware Scheduler blocks. These are very important bits for Asynchronous Compute and keeping a lid on things in general. If you have been following the evolution of AMD’s HSA, you know queues and task parsing is a main focus point of the technology. HWS adds a hardware unit to those functions.
As the name suggests, HWS schedules tasks, a non-trivial job in a device with 36 CUs, each having 64 shaders or 2304 in total. That is a big number of mouths to feed so hardware is almost a necessity more than a helpful addition. These units can essentially slice up a GPU’s units either spatially or temporally and assign tasks to them. Programming of the HWS units is in microcode so don’t expect direct access for your first indie title, it is the domain of driver writers, embedded folks, and low-level VR developers.
For VR you may have heard the term Asynchronous Compute, AMD’s Liquid VR, preëmption, and Quick Response Queue (QRQ). All of these things need the ability for one task, usually a compute based one, to interrupt a graphics job. Graphics jobs tend to be long running, millions of pixels per frame mean a lot of calculations, which are relatively latency insensitive as long as they get done before the frame is needed by the monitor. Compute tasks, especially sound and VR/positioning, are significantly more latency sensitive.
How asynchronous concepts work
Without HWS and guaranteed time or resource availability, things get ugly for async tasks. With HWS you can guarantee parts of a GPU will always be available for a task, a task will always have priority access to a subset of the shaders, or a mixture of spatial and temporal quality of service options. This is deadly important for VR and most modern graphics APIs like Vulkan and DX12, plus it is the functionality needed by Oculus in their Time Warp feature. There are two HWS blocks just to spread the work out, each one can access all of the GPU but can only pull requests from so many queues at once.
Other than VR one place where this async functionality benefits the user is sound. You might remember AMD’s True Audio, basically a DSP block attached to the GPU with low latency access to the geometry data stored on it. What neither AMD or Sony will tell you is that both True Audio and the PS4 sound system are the same thing. With HWS AMD implemented something called True Audio Next (TAN – something SemiAccurate will go out on a limb and call True Audio Neo for no particular reason).
TAN does not use the DSP block, in fact it is not there in the Polaris generation of GPUs. Without dedicated hardware, audio becomes a crackly mess because of timing and delays. Async makes it better, Premption makes it workable if not good, and QRQ is needed for VR useful 3D sound. This all works because if you know you are going to do 3D sound with reflections that are accurate to the geometry of the current 3D environment, you need a lot of compute in a tight time window. HWS can provide that by dedicating time or shader blocks to the task, without it TAN would be a mess if it worked at all. If you want to play with this feature, there should be libraries available at GPUOpen.com soon.
Related to this and using many of the underlying blocks is a feature called Variable Rate Shading. What this allows a programmer to do is control quality and rendering features based on zones of the screen. If the center of the screen is where the action takes place, it can get more detail and the corners less. If you add eye tracking to the picture it becomes foveated rendering, aka higher resolution where you are looking. HWS can simplify this work quite a bit.
Shaders have more caches
Those are the two main bullet point additions to the hardware on a macro level, the rest is a more detailed especially in the shaders. Those SIMD units are the same x4 units as the last many generation of AMD GPUs but their efficiency has been increased quite a bit by prefetch and caches. Prefetch is better so less stalls and better cache residency, the same as on CPUs, both aided by the aforementioned larger L2. This L2 is also more efficient to access and request to it can now be grouped. The instruction buffer for waves is now deeper too so less stalls waiting for work. About the only really new feature is support for 16b Int and FP operations. This all adds up to a claimed 15% performance increase on the shader block level vs Hawaii.
Sticking with the efficiency theme is an addition to the geometry engine called Primitive Discard Accelerator (PDA). What this does is pretty simple, a lot of geometry is either not seen because it is behind something else, on the back side of a polygon, or is simply too small to be displayed, sub-pixel. Normally these are thrown out in the rendering process but there is a lot of compute used on them before they get tossed.
PDA pulls the discard process earlier into the pipeline potentially allowing all those wasted cycles to be used on something beneficial to the user. More importantly this can now happen before tessellation saving a multiple of those wasted cycles. If done properly this can result is huge power savings or much higher geometry throughput.
Another addition to the geometry engine is called the Index Cache. This is a small cache for instanced geometry, basically things that are replicated on the screen a lot. Normally the GPU would have to go to memory or L2 cache to pull in the mesh for this polygon every time it is needed by the pipeline. With the Index Cache it is stored locally again saving power and memory bandwidth, again with the efficiency.
In case you are getting bored hearing about glorious detail changes that enhance efficiency, we bring you word of Polaris’ memory controller. It is chock full of glorious detail chances that enhance efficiency starting with Delta Color Compression (DCC). This was first seen in the Fury X GPUs and has been enhanced in Polaris.
DCC for more memory bandwidth
Although the width is only 256b vs 384B on the 290/380 cards, AMD claims that Polaris is actually more efficient and has higher effective memory bandwidth. Part of this is the larger L2, part of this is due to the new 14nm process, but most of this is due to DCC which now supports 2/4/8:1 compression. Every time you can compress data you send less to memory, store less in cache, and save power. Better yet that bandwidth can be used for more important things. Coupled with 8Gbps GDDR5, Polaris has a lot better usable memory bandwidth than the numbers suggest.
Moving on to video there is a new encode/decode block which at a high level adds HEVC 4K60 encode support. This may sounds anticlimactic with the state of modern cell phones but AMD adds one very useful feature to the mix, 2-pass encoding. By using a downscaled pre-encode pass for rate control guidance along with a few other tricks, the second pass can be much more effective. AMD uses this for both H.264 and HEVC with some visually impressive results. In short you can stream 4K60 video with much better quality and lower bandwidth on an RX 480.
Decode is not forgotten either with three new codecs supported in hardware. HEVC Main-10 at 4K60 is the headliner with 4K VP9 and 4K30 MJPEG added on too. These are useful and the ability to encode a stream and watch something else in a window is now pretty mandatory. All of these things plus all the old encode/decode formats are, wait for it, more efficient too. One other important thing to note is that the video block supports 10b color so HDR will be usable in the Polaris generation.
Moving further down the bitstream we come to the all new display controller, it is probably more efficient too. The changes to this block are the addition of DP 1.3HBR and DP 1.4-HDR, at least Polaris will when the latter comes out. 1.3HBR allows 4K120 video in 4:4:4 or 5K60 over a single cable. Know any company about to introduce a 5K monitor that may not have announced a GPU supplier yet? DP 1.4-HDR brings 4K96 HDR/10b to the table along with monitor metadata transport (CTA-861.3). 861.3 is probably going to change the display, gaming, and rendering world more than nearly anyone understands but it will take a few years to filter down to the entire pipeline. More when it becomes visible to users.
Last up we come to process and related tech starting with the new Globalfoundries 14nm FinFET process. This makes for smaller and more efficient chips but that is only the beginning. The first power lowering technique is called Adaptive Clocking which senses a power supply droop and downclocks the chip to avoid crashing. This is exactly the same as it was on Carizzo, and the effects, higher bins because of lower guardbands, are the same too. AVFS (Adaptive Voltage and Frequency Scaling) has been updated too so the chip can be run much closer to peak frequencies. Think of this as a more granular and effective PowerTune.
Couple that with power measurement, Boot Time power supply Calibration (BTC) and Adaptive Again Compensation (AAC) in AMD nomenclature, and you have a more efficient part. BTC effectively takes the manufacturing voltage data from testing and binning, then compensates for that every time the GPU boots. This drops the necessary overcompensation by tailoring voltage at boot for each device, not each bin. AAC does the same but compensates for the added voltage needed as the part ages, a normal occurrence with modern semiconductors.
More specs for Polaris
Add all this up and you get the above, Polaris 10 also known as the Radeon RX 480 GPU. It will be priced at $199 for the 4GB model, $239 for the 8GB version, and is available now. In quantity.
It is a mid-range card aimed at the meat of the market, not the usual high-end halo first launch. AMD wants to democratize VR with Polaris by bringing it to the mass-market price points and the RX 480 has all the pieces to do just that. Chain two together in Crossfire and you have a very solid performing solution for far less than the competition. It should do quite well in the market.S|A