It looks like Imagination is getting ready to introduce their new 7-Series GPUs in both 7TX and 7XE forms. If you liked the 6-Series ‘Rogue’ GPUs, this new iteration will be both familiar to you and better for you.
Like the 6-Series, the 7-Series comes in high-end 7XT for performance minded applications and 7XE form for area and power restricted devices. It ranges from a high of 256 ALU pipelines, 16 clusters of 16 pipes on the XT, to fractions of one pipeline in the XE. Since each pipe officially has two ALUs that puts the ‘core’ count at 512, quite impressive for a mobile GPU. Lets take a look at the details.
The core of a 7XE GPU, the USC
The lowest end parts are the 7XE line which has one Unified Shading Cluster(USC) Array of 16 pipelines as a maximum. If that is too much you can halve or quarter it to a mere 4 pipes but each pipe is more capable than you might think. As you can see above there are seven ALUs in each one, 2x 32b, 4x 16b, and one special function units. This is counted as two ALUs for technical reasons but we applaud Imagination’s honesty here, if only everyone would take the high road in naming….
Back to the tech, you might recall that the 6-Series GPUs had roughly the same layout as the 7-series. The 7-series is said to be much faster than the last generation and one of the main advances is in the special function unit. In the 7XE and XT that special function unit can co-issue with the ALUs, in the 6-Series it could not. This removes a major bottleneck for throughput because those calls no longer block the entire ALU throughput like it used to.
Another interesting point is that the 16b and 32b units are physically separate hardware, Imagination does not chain 2x 16b units to get a 32b ALU. This is done for power savings and if the numbers in the benchmarks hold up in customer devices, it will be worth a worthwhile tradeoff. Unfortunately you can not co-issue 16b and 32b commands, pick one or the other, plus a special unit command.
The bits that wrap around a USC aka a 7XE GPU
The 7XE line has a core feature set of OpenGL ES 3.1, OpenCL 1.2, and some feature advances over the 6-Series line. One of the most important advances was to put more instances on the Data Master that feeds the ALUs. 6-Series had a single stream, the 7-Series can have many which allows them to hide latency on multiple streams. There are other advances that come into play more in the 7XT but suffice it to say everything is a bit better on the 7XE than on the 6XE, even the ISA was cleaned up and streamlined.
Since it is focused on power and area efficiency, things not needed for the scaling found in the 7XT were left out. Other than that, the 7XE and 7XT are the same device if they are optioned to the same feature level. There are four of those options of the 7XE line, a Compression Pack for high bandwidth targets like 4K TVs, HEVC pack with 10b and YUV support for video devices, a Security and Virtualization pack for the content MAFIAA and automotive users, and an Android Extension Pack for Phones/Tablets. Together the core 7XE features and the four packs make the base feature set for the 7XT.
Three of these options on the 7XE are pretty straightforward, the Security and Virtualization pack is not. In the typical ARM based devices where you find Imagination GPUs there is a way to have a secure area along side a non-secured area. This basically forms what ARM calls Trustzone and it does what it says, it keeps the hot side hot and the cold side cold or something like that. Unfortunately if you have two apps running in a secure area there is nothing to keep them from doing bad things to each other.
That is where Imagination is upping the game, the 7-Series GPUs support multiple secure areas along side non-secure areas. Each secure zone is independent from the others so a malicious app that manages to run as secure is fenced off to only its own memory and access location. The GPU also has a microcontroller that can cryptographically attest to its own security, basically secure boot on the GPU side. Technically speaking instead of the system MMU having a 0/1 security state, the 7XE/XT can utilize a few more bits for the security state, at least three and possibly quite a few more.
The main question this brings up is that if the GPU can see and work with multiple separate secure areas, so what? You might recall that while Imagination makes MIPS cores that are perfectly capable of running the devices that these GPUs will be put in, the overwhelming majority of devices have ARM cores. If ARM doesn’t support this multi-bit memory/security mapping, what good does it do? As it turns out, quite a bit because this type of mapping is not done at the CPU core level, it is an SoC fabric based feature. If a device maker implements this multi-bit mapping or licenses a fabric that does, that is all you need to see bits 2-whatever. If they base their system on a MIPS core it is that much easier to implement but regardless, it shouldn’t be much of a trick to use this with ARM cores. If we had to bet we would put money on ARM implementing similar or compatible functions in their next core architecture revision.
That brings us to the big brother of the 7XE, the 7XT. Like we mentioned earlier a 7XE with all four feature packs is the base functionality of the 7XT family. From there you can add up to two feature packs should you feel the need. These are an HPC Feature pack for compute usages and a DX11 pack for Windows phones and tablets should sales not be high on your priority list. It is ironic that Imagination lists only the HPC pack as “niche usage cases only”, the market has spoken otherwise.
A 7XT USC is more than a 7XE USC
The main feature of the 7XT over the 7XE is that from the base unit of 16 shader pipelines of seven ALUs, it scales up where the XE only scales down. The 7XT can go up to 16 USCs for a total of 512 ‘cores’, if you look at the diagram above it looks a lot like its little brother with two major changes. Other than the added USCs, you might have noticed the optional FP64 ALU on the bottom. This is where the HPC pack comes into play, it adds a single 64b unit that you can issue a command to instead of the 2x 32b or 4x 16b ALUs. While we didn’t get into the specifics of the implementation, it is separate hardware from the 16b and 32b units, essentially a third option for the issue hardware.
If you recall we said earlier that the 7XE and 7XT were basically the same cores but things were left out. Much of that falls into two categories, scaling and feature sets. If you don’t need to scale beyond one USC, you can leave out a lot of interconnects, coherency functionality, and memory overhead. This saves area, power, and hair on the heads of validation engineers.
Similarly if you don’t need to optionally support a third ALU type and/or DX11 functions, you can leave out the functionality that supports those things too. On the other hand if you don’t need to scale to 1/2 or 1/4 USCs, you can optimize around that possibility too but this is a minor win in comparison. In short the two are very close to the same basic device, a 7XE with all four packs is essentially a one USC 7XT. That said the 7XT line starts at 2x USCs so that extra logic is actually always used.
The 7XT GPU starts where the 7XE GPU finishes
Like the 7XE the 7XT has the same multi-stream data master for added compute throughput and lowered latency. Unlike the 7XE the 7XT based devices are going to put this added capability to good use. Better yet there are additional optimizations to the geometry and memory accesses that while present on the XE don’t do nearly as much. Why? Because they are used to intelligently place data over large numbers of USCs, something that the XE doesn’t support.
The basis for the added performance and power efficiency is hierarchical scheduling and the clustering of tasks on USCs. This functionality was present in the 6-Series GPUs as well but it is vastly improved here, nothing speeds up the learning curve like seeing the generation before does in the real world. This isn’t to say that the 6-Series was bad, it was probably the best in class for its time, the 7-Series just takes things to a new level.
If you are parsing out tasks to multiple 7XT USCs, the Vertex Data Master will cluster related geometry data onto the same USC. This not only keeps more data local and available to the unit doing the work, it also groups memory requests. If you don’t understand the value of having lower numbers of larger related memory accesses vs many more small requests often for the same pages, lets just say it is a massive performance gain. More important than that, especially on the mobile side, is memory requests are hugely energy intensive. Lowering their count and grouping them sanely not only takes a big chunk out of latency, it also saves tons of power. This is where a bunch of the non-ALU core gains come form in the 7XT.
In the end the new 7-Series GPUs from Imagination looks a lot like the 6-Series when viewed from afar. Once you dig into the details there are enough advances lurking under the hood to bring a claimed 60% performance increase on the 7XT and a 108% improvement on the 7XE. Once again Imagination kept the cluster count and clocks constant for each so the improvements look pretty real. Until actual customer silicon reaches the market with 7-Series GPUs under the lid, it will only be theoretical performance gains though. Luckily it won’t be too long before we see actual results and I have a sneaking suspicion that their claims will hold up quite well.S|A