LSI had a talk on the Next Generation MultiCore ARM Architectures in networking devices, and it had some very interesting reasoning for their architecture. While most of them are probably familiar to those in the embedded space, it offers some interesting perspectives to those of us who are not.
The talk is obviously centered around LSIs Axxia line of chips/SoC/whatevertheyarenow, a line of multicore products that can be mixed and matched. You can read a bit about the architecture in this article, but things get weird from there. The reason isn’t any specific part that causes this, it is more the pieces and the mix and match nature of the line. LSI being open to 3rd party party requests, not to mention allowing bespoke 3rd party IP in Axxia, doesn’t add simplicity to the mix either. That said, the end, it is a really interesting family of, well, things.
A bit more detail on the spine of Axxia
If you look at the article linked above, the first thing you notice is that the previous picture had CPU cores, this one has ARM cores. That is because you can specify a PowerPC version or an ARM version in addition to the rest of the accelerators, caches, and anything else you want to put in. There are also two separate busses, but that isn’t really pictured here, and everything is hard partition-able to whatever level of granularity you choose. Axxia is not a chip, it is a very very diverse family.
Architectures that are unusual by PC standards are normally designed that way for a purpose, and LSI’s Axxia is no exception. It is designed for some diverse networking workloads, from high throughput to low latency, with lots of packet twiddling or not much at all. The company identifies three different categories of tasks that an Axxia chip can be used for, most likely all simultaneously, but not necessarily so. They are general purpose CPU tasks, I/O, and Acceleration and Dataplane work, all of which have radically different QoS needs for the tasks at hand.
General CPU work needs low memory latency, performance is cache miss dominated just like any mainstream CPU. I/O is more tolerant of memory latency, but has some hard realtime requirements that general purpose CPUs usually don’t. The dataplane side is much more latency tolerant but has bandwidth needs that dwarf the other two tasks. In some cases, you need to guarantee bandwidth as well.
The end result is pretty simple, the CPU needs highest priority memory and bus access, followed by I/O, and then dataplane jobs. This sounds simple, but if bandwidth drops too low, dataplane tasks are not happy. This needs to be monitored closely to prevent problems, and priorities changed accordingly. I/O is similar, it can be a lower priority unless one of the hard deadlines is in danger, then, well, it needs priority too. The simple rankings based on latency suddenly become a complex monitoring and priority juggling problem that has realtime consequences. This is all quite necessary due to the workloads involved, carriers are not happy with dropped calls because of hardware bugs.
Typical server workloads vs networking chip workloads
Those workloads are quite different from what most people think of as a server. In the diagram above, on the left you have a traditional server with the task spectrum broken down in to compute and other, with other being acceleration. On the right are the typical workloads facing an Axxia chip. As you can see, there are a lot more specific tasks, each with some very different priorities. Just to make the puzzle more complex, some of these tasks are much better suited to specialized hardware like DSPs, so some of those are added in to the mix as well. And they have to be prioritized correctly too, that isn’t just a problem for the CPU cores.
Given that the Axxia line can have 32 ARM A15 cores, it might seem like the accelerators are redundant. If a general purpose CPU can do the job, custom hardware should be able to do the same thing faster with less power. The problem with using custom hardware is that it works by trading general purpose functionality for a very specific subset of tasks, and it isn’t very flexible.
For most of the jobs that Axxia is aimed at, the workloads change fairly rapidly, and the code that runs on them is changing almost as fast. What was being run when the device was deployed may not really resemble the running code a year later. Fixed function hardware for this type of workload is a nice idea, but in practice doesn’t work out. Even specialized instructions don’t pay off, the code for even relatively simple things like MAC Scheduling is thousands of lines of code, and it gets worse from there. Because of this, brute force tends to work out best for as lot of the jobs at hand, at least until the phone market becomes standardized. And the workloads don’t change. And…. well, our advice is don’t hold your breath while waiting.
In the end, LSI is catering to a very diverse market called networking with Axxia, but not with a next generation ARM device. Instead they are using a paradigm backed by a catalog of pieces to assemble before throwing at customer. LSI has a bag full of silicon Legos, plus a service to make your own specialized pieces, and put it all together. You don’t buy a speed grade, you buy a castle, a skyscraper, or spaceship, but not a small moon, all pre-assembled for your convenience. And that is how LSI sees the next generation of ARM devices, all connected with not just one but two busses for you to slice in to partitions before serving customers.S|A