What do you get when you multiply current ARM GPUs by 1.5? You get Mimir also known as Mali-G71. Yes the new generation of Mali graphics is here and it has a lot of changes from the last ones SemiAccurate told you about.
So Mimir is finally here and it is the first of the new generation of Bifrost architecture parts which replace the current Midgard GPUs. Those with a long memory will recall that Midgard replaced Utgard, the marketing names go Mali-XXX, Mali-TXXX, and now Mali-GXX. Just in case you are wondering, T stands for Tri-Pipe, G stands for either graphics or Gary, the guy who sweeps the floors at ARM, and XXX stands for…. numbers.
The ARM GPU progression
So what does this new family bring to the table? Full coherency, 32-shader cores, lower latency, and a lot higher performance all with far greater efficiency. This may sound simple enough but to get there ARM made significant changes to both the basic shader architecture and how it is fed. Lets take a look at all of this in more detail.
Starting out with the main bullet point, coherency, which of course requires a fully coherent interconnect. This is the CCI-550 interconnect we told you about above which supports, wait for it, full GPU coherency. While you could probably put a Mimir cluster on an older interconnect, that would be rather pointless because you are throwing out its main performance benefit, fine-grained shared virtual memory.
Only 90% better?
If you haven’t caught on to the trend by now, let us point out that this needs full coherency. How important is this feature? ARM claims an 81% reduction in memory overhead with coarse-grained memory virtual memory, and yes this requires full coherency too, and more with fine-grained. Just in case you totally missed the slide above, fine-grained will drop it further to 90% less than the non-coherent version. Yes fine-grained also requires coherency but isn’t that minor savings worth doing less work for by implementing the correct bus? We think so too.
One of these nodes are not a power of 2
Added scalability comes with the new architecture which can now range from MP4-MP32 or 4-32 shader core configurations. The granularity of these jumps is of course power of two except for the MP6 model but polite dinner conversations don’t include talk of the odd children so we won’t go there. As you can see ARM expects phones to be in the 6-8 core range for mainstream, double that for premium, and four times the count for those not keen on long battery life. As an interesting aside ARM had a slide that showed a Mali-G71 being competitive with a “2015 Discrete Laptop GPU”. Mobile graphics being only two years behind on Manhattan 3.1 @1080p is impressive.
The lower latency claim is not a direct technical change, just a result of a lot of little changes that all add up to better net numbers. Because of VR and the headaches that induces, both for developers and users, latency is seen as key. ARM is claiming their goal is a 4ms latency from sensor to screen in VR environments but as usual your mileage will vary. Since the new G71 supports 4K screens at 120Hz refresh with 4x MSAA, getting this down to 4ms with all the features turned on looks like a bit of a stretch at the moment. Still with more reasonable settings that numbers should be quite achievable barring ham-handed clods of VR programmers.
That brings us to energy efficiency and the official claims 1.5x the performance, 20% higher energy efficiency, 40% better performance density, and 20% better bandwidth utilization. It doesn’t take a genius to realize that doing more work with less transistors and less I/Os will use less juice, but how? That is a long story which starts out at the base execution units, the shader cores.
The old Midgard way is 3/4 good
You probably don’t remember but the Midgard architecture you know and love is a four wide architecture four stages deep. Each cycle one thread, aka a triangle or quad, is issued to the execution units. Since they are four wide they can take a full quad a cycle which is a really good thing. Unfortunately most game developers seem stuck on triangles which tend to use only three of the SIMD vector lanes. This is bad but modern power gating means it won’t consume hideous amounts of power, it just doesn’t utilize the hardware to its maximum potential often. The technical term for this is inefficiency.
The first big change in Bifrost is what looks like a rotation of the execution units by 90 degrees also known as a scalar architecture. Four threads are executed in parallel and instead of one issued per clock as in Midgard, you have four issued every four clocks. ARM calls this bundle of four threads a quad which should not be confused with a four point polygon which is also called a quad. For the sake of clarity we will mark the polygon quads with HTML6.2 red tags, the instruction bundle quads with blue tags so if you don’t see those differentiators, make sure your browser is set correctly. (Note: This is a joke, you can stop digging through your menus now, really.)
Bifrost shader cores are 4/4 good
This seemingly minor change means that if your app issues tris instead of quads (pretend that last word is red, bold, blinking, and others things that may be introduced if there is ever an HTML6.2), the control logic of Mimir has to issue a quad (blue and blinky etc) every three cycles. The side effect of this change is of course full utilization of the shader cores and potentially massive gain in efficiency. Even if the power gating is 100% effective on the Midgard architecture, Bifrost will have 33% less uncore overhead to do the same work. You can take this as added efficiency, added performance, or both.
This isn’t to say there are no problems with a scalar architecture and quad bundles (blue blink blink, oh you get the idea, we will stop with the asides on this topic for now), there are. One big one is called divergence which for example could be two threads with an or statement followed by two more threads. This would destroy efficiency but luckily it is a well-known problem which both developers and tools are well aware of. The take home message is that quads and scalar architectures are vastly more efficient but can have pitfalls too if you aren’t careful. This caveat can be equally applied to almost any architecture come to think of it, but we are only talking about Mimir here.
Bifrost shader block diagram
This execution model allows you to get an understanding of the layout of the Bifrost core design, specifically the Quad (see no red/blue blinky this time) Manager block that replaces Midgard’s Thread Manager. The other major change is the control fabric which connects nearly everything, Midgard had a much simpler Control Bus that barely connected anything.
Do take note of the compute and fragment front ends, they each make quads in their respective quad creator blocks. Pixel related tasks go to the fragment front end, compute tasks to the compute front end, and quads come out of the creator blocks. These are then fed into the quad manager which does roughly what its name implies and feeds the execution engines with pre-bundled quads. This means there is only one fetch per quad pipe saving time, energy, and transfer bandwidth.
Shader execution path
These execution units are on par with most modern GPU math units with 4x32b multiplier/FMA units and 4x32b adders. Special functions are included in the adders and the rest is control overhead. For those into minutia, a 32b float can execute 2x16b float operations and integer ops can be 32b, 2x16b, or 4x8b. These math pipes are smaller and more efficient than those in Midgard and have higher utilization. One internal change is that an instruction word now contains two instructions. Caches/storage at this level is roughly doubled and everything is more efficient on a microarchitecure level.
Clause scheduling in action
Stepping out another level a lot of this efficiency is enabled by the use of clause scheduling. What this does is break threads up into blocks that that are bounded by instructions that need scheduling decisions, think branches and the like. This allows code to be chunked into blocks that have an architecturally visible state guaranteed after each clause is executed vastly simplifying scheduling. This is the long way of saying that execution is guaranteed for the length of each clause. Overhead is also reduced from every instruction to every block again saving power.
This allows for some really aggressive optimizations in scheduling because you are guaranteed to not have pitfalls inside of the clause itself. Variable latency ops are only allowed at clause boundaries for example so there is no guesswork there. Why? Because GPUs are massively threaded processors for massively threaded workloads, if you interleave threads on a clause level, a long latency or stalled op will finish before the next clause in the thread is executed. Meanwhile a lot of useful work is done on other threads saving a lot of energy and complexity in upstream units.
The main Bifrost architecture
Moving out from there we have the main Bifrost architecture. At this level it looks fairly generic but as we have just covered the shader cores are radically different, especially at the blinking color level. Seriously thought the blocks listed here and their interconnects do not represent a radical change in form, it is how they operate where the magic lurks.
Take for example the tiler unit, Midgard had one so and Bifrost uses the same hierarchal design but with redesigned memory structures. This eliminates minimum buffer allocations and lets the system allocate buffers in a much more granular manner. Micro-triangle elimination reduces needless primitives and saves memory too, everything together reduces the memory footprint by a claimed 95%. This again adds efficiency but also works just like the old one.
Index driven position shading
A similar theme can be found in the use of index-driven positioning shading which starts with a split vertex shader. This allows the tiler to to cull geometry and not write it back saving lots of bandwidth. Some steps are the same but others have massive drops in bandwidth, ARM claims about 40% less is used. This results in more speed and less energy used, ARM claims ~20-30MB saved per frame on T-Rex.
With that we come to the memory subsystem itself. If doesn’t have any big bangs other than the headline advance for the product, CPU/GPU coherency. Since most coherency on a core level is done though the L2 cache, Mimir/Midgard has a single logical, did we mention coherent, L2 cache. This is also directly connected to the ACE memory bus and then the CCI-550 interconnect. No green blink joke needed, it should be obvious here, right?
As you would expect the memory system and the entire GPU is Trustzone aware and works with modern A-series cores, display processors, and video processors. This isn’t really new but the content a G71 GPU touches can be ‘secure’ as well as coherent. One thing to add which sort of ties into this memory and security theme is virtualization. As is the norm with most SoCs, this is the point where we say that the device in question is fully hardware virtualizable. Unfortunately in this case the G71/Mimir/Bifrost is not hardware virtualizable but we expect this to happen in the next version or two.
And that is the brief overview of Mimir/Bifrost/Mali-G71. On a macro level it brings coherency and fine-grained virtual memory to the mobile GPU party. As you move down to the blocks and then the details of the sub-units, more and more is changed. At the number twiddling level the basic architecture changes from vector to scalar, a 90 degree turn which ends up with massive efficiency gains. Those gains percolate back up to the point where ARM is claiming a 1.5x improvement in performance for Bifrost over Midgard.S|A