Today ARM is announcing a new interconnect called CCI-550, a new GPU called Mimir, and a new memory controller called DMC-500. If that isn’t enough, well that will have to do because those three are all SemiAccurate was told about this time around.
If you are not familiar with the ARM interconnect families there are three, the low-end CoreLink NIC, the mainstream CCI- family, and the server oriented CCN- lineup. The main differences are that CCI- is a coherent crossbar architecture aimed at the high-end consumer space, phones, tablets, and other things with single digit core counts. NIC is not coherent and is more embedded and device oriented, think one core or maybe two. CCN- is the highest end, ring based, coherent architecture that scales to tens of cores. We talked about the CCN- family here and here.
Getting back to today’s announcement, it may sound quite a bit like the last one, the CCI-500 line. SemiAccurate went into detail about that here and if you read it you might notice ARM also announced a new GPU and new cores at the same time. This time we get a memory controller instead of a core but we still get a new GPU code name to set the forums atwitter. And yes I am surprised as you are that atwitter was in the Libre Office spell checker, who knew?
So lets dig into the details, and for those of you familiar with the modern ARM interconnect families, the CCI-550 will look a lot like the CCI-500 it replaces. As is the norm with this type of announcement, the -550 scales higher, has more features, and does it all with less power on the same process. That said from the diagram view, it really looks quite similar to the older version, see?
The block diagram for the new CCI-550 interconnect
The first big change is the ACE interface count, it goes from four to six fully coherent ports. Since each ACE can support a cluster of up to four ARM cores, this means we can see up to 24 core cell phone chips. For a market that struggles to use more than two cores effectively but sells 10 core SoCs, this is overkill. Luckily the core count wars are rapidly dying out and better yet there is actually a good reason other than more cores to up this interface count.
That reason is to have fully coherent GPUs and technically other devices but GPUs are the big one. Things like HSA and other forms of GPU or asynchronous compute are made much easier by coherence so now that tech is trickling down to the cell phone world too. What started in bleeding edge multi-socket servers is not the stuff of cell phones in just a few years time.
It all sounds good until you realize that ARM does not actually have a coherent GPU, the Mali-T8xxx line is not coherent. This would be a bit of a problem if ARM wasn’t also announcing their new GPU family called Mimir. This new GPU, presumably to be called the Mali-T9xxx line, is, can you guess? Yes Mimir is coherent. One would presume that the crop of GPUs from Qualcomm and Imagination that like Mimir will hit silicon late next year will be coherent too.
Getting back to the CCI-550 family there are some other macro changes too. First you can have up to seven slave ports now and up to six memory ports. Conveniently there is also a new memory controller called the DMC-500 to place on said six memory ports. DMC runs… nope, not going to go there, start again. DMC-500’s main feature new feature is LPDDR4 support with speeds up to 4267MHz which likely means support for LPDDR4x.
ARM is claiming a 27% increase in bandwidth utilization for the new controller, a nice trick if you don’t blow out the power budget doing so. Better yet the DMC-500 supports end-to-end QoS when on the CCI-550 for a claimed 25% reduction in latency for the requesting CPU. In the end ARM says that the new DMC-500 has a 60% bandwidth increase over the older model, but that is comparing four to six channels. On a four to four channel comparison, that means a <10% efficiency gain which is still very good considering how hard it is to get anything more from a relatively mature controller design.
Once again going back to the CCI-550 interconnect, there is one more really big item to talk about, a snoop filter. See the comment above about multi-socket servers and move it forward a few years, that too is now in cell phones. Snoop filters are probably familiar to most SemiAccurate readers but if not, it effectively filters out snoop traffic so requests are only made to places where the data could be.
This prevents broadcast traffic that can overwhelm busses in very short order, the traffic grows quadratically with target count. Since CCI-550 raises the potential core count from 16 to 24, things get ugly fast. Of likely more importance to ARM and it’s customers is that each of these snoops takes energy to do, less snoops means less power pulled from the battery, a cooler chip, and all the other good things you get by using less wattage. In short it is more complex but saves massive bandwidth and energy.
A snoop filter has up sides and down sides, one of which is cores that are snoopable have to be awake to service the snoops. This is a bit at odds with the prevailing idea behind most modern mobile CPUs which is to turn everything off as soon as possible to save energy. Off is unsnoopable which kind of breaks the whole coherency paradigm. You could keep the caches and some logic awake while sleeping the rest of the core like some higher end CPUs do, but that also has tradeoffs that are less amenable to the power and usage levels ARM cores play in so that isn’t done on CCI-550.
Instead there are a few clever tricks that ARM implemented starting with the ability for blocks to enter and exit coherency domains on the fly. This is going to be used on GPUs more than CPUs because if the screen is off, the GPU is probably not needed. ARM implemented a signaling mechanism so that individual blocks can safely exit a coherency domain, presumably after flushing applicable cache lines, and power down without taking the entire coherency domain with it. Unintentionally that is.
There are also handshakes but for the most part, the heavy lifting is done by the in/out signaling. Since ARM elected not to decouple caches from the rest of the core, this also means that cores have a minimum frequency floor they can run at. Below this speed means a core can’t service a snoop request in the necessary time so that caps the lowest end before it has to leave the domain. This is not the ideal scenario but it seems the tradeoff was better than decoupled caches for whatever reasons ARM evaluated.
Snoop filters take up die area and power themselves, both of which designers try to minimize. To aid in optimizing that the new filters in the CCI-550 are configurable by cache sizes covered and memory spaces supported by the SoC. You can make it as big or small as you want but performance is obviously directly affected by these choices.
Other technical tidbits are that the interconnect has a full QoS setup as referenced earlier, it goes beyond the DMC-500. This increases bus bandwidth by a claimed 60% over the CCI-500 and drops latency by 20% too. Add in all the other efficiency and you save a lot of power for a faster bus, what’s not to like? If you said bus frequency scaling, it can do that too.
So in the end the CCI-550 replaces the older CCI-500. It scales higher, uses less power, has lower latency, and supports more coherent ports. With the ability to add coherent GPUs, ARM introduced their new Mimir coherent GPU, what timing eh? On top of that there is the new DMC-500 memory controller that can speak the newly introduced CCI-550 signaling language, plus you can now add six of them. When the next-gen of ARM SoCs are introduced next year, you can probably guess what they will look like.S|A