Those of you following ARM servers will know that the Cavium Thunder X2 is about as related to the Thunder X(1) as, well there is really no relation. This one is an ex-Broadcom chip code named Vulcan, and it lives up the name by breathing fire. More on that later. The original 48-core Thunder X was the uncore of their traditional Cavium MIPS based offerings more or less wrapped around custom ARM cores. There was a successor part also called Thunder X2, a 54-core CPU that was a next-gen Thunder X.
Then as SemiAccurate exclusively told you about, Broadcom unceremoniously dumped Vulcan for reasons only beancounters understand. A few months later as we again exclusively told you, Cavium bought the Broadcom Vulcan project. This was a really good thing for Cavium, it meant they had a nearly finished high end, fire-breathing ARM server CPU.
It also gave them a problem in that they had two distinct and utterly different ARM server architectures. Thunder X (TX) had lower performance cores coupled to an infrastructure meant for comms and high throughput workloads. Thunder X2 (TX2) was a high performance core with far fewer accelerators and other things, mainly aimed at HPC and compute heavy workloads. The non-Vulcan based pre-acquisition TX2 was more like TX.
But this launch was all about the Vulcan-based Thunder X2 and it is a really high performance part. How high? Cavium compared their parts to both Intel’s and AMD’s latest and greatest server CPUs and came out on top. This of course comes with the caveat of being vendor numbers at a product launch but this time they are backed up fairly well by independent testing. Depending on which numbers you look at, who’s compilers you use, and a host of other features, the TX2 runs with the big boys on their own workloads.
So what is Cavium launching? 40+ SKUs of the Thunder X2, available now for real, that range from a ‘mere’ 16 cores to 32 cores, all dual socket capable. Specific pricing and TDPs weren’t listed for all 40 SKUs but the top end CN9980-2500LG4077-Y21-G lists for $1795 and has a 180W TDP while the smallest CN9960-1600LG4077-Y21-G can be yours for $800 and pulls 75W. Enjoy the eye chart below, especially the listed competition.
Just a few SKUs, 40 to be exact
We put the SKUs before the details for a reason this time, and that reason is that Cavium is confident enough to compare the Thunder X2 against the current Xeons, SKU for SKU, top to bottom. As the linked numbers show, these comparisons aren’t entirely out of line, especially considering that the Intel parts cost a large multiple of what the TX2s run. But raw numerical performance matters little if the rest of the chip is lacking, so how does Cavium stack up on features?
Pretty well actually. The cores support four threads leading to 128T/socket or 256T/system, Microsoft had a demo with task manager running on a 256T machine and it was a bit cluttered. That said it worked so if you have a highly threaded workload, Cavium is an interesting choice. Couple this with eight channels of DDR4-2667 with up to two DIMMs per channel, and you have more memory bandwidth than Intel’s best but tied with AMD’s Epyc offerings. Either way things line up well on the memory front.
Block diagram of Cavium’s Thunder X2
On the inter-socket link side, Cavium has a claimed 600Gbps line between the two dies. If you look at the chart above it shows 24x 28Gbps SERDES for 672Gbps bandwidth before overhead. That is more than enough, somewhere between very competitive and overkill depending on whether or not that number is for uni-directional or bi-directional bandwidth. And yes that link is coherent.
Sticking with the I/O, TX2 is a real SoC like AMD’s Epyc but unlike Intel’s Xeon. This matters for board complexity, layout, and of course how much the vendor charges for the chipset(s). That said Cavium includes 2x SATA-6 and 2x USB3 on die along with low speed I/Os, no external silicon necessary.
More interesting are the 56 lanes of PCIe3, less than AMD but more than Intel, and that is per core. One very smart thing Cavium did is put in 14 PCIe controllers for a granularity of 4-lanes per controller if needed. You can do three 16x lanes plus one 8x lane or split it up to 14 4x lanes. Why is this important? Storage, think NVMe drives aplenty without having to rely on external hardware. Throw in the odd 25/40Gbps NIC and you have a nice storage server at a fraction of the cost of a dual Xeon rig. TX2 also supports SRIOV for virtualization of peripherals.
Getting back to the processors you see there are 32 4-thread ARM v8.1A cores on a ring bus. Eight stops means four cores per stop plus four more stops for the two memory controllers, I/O, and inter-socket links. The ring is bi-directional but nothing more was given about it at launch.
On the cache side we have fairly standard 32KB L1 I/D caches backed by a 256KB L2 per core. The L3 is cut into two 2MB slices per stop for a total of 32MB or 1MB per core. Luckily those caches can be enabled even if the cores are disabled so those needing large caches but fewer cores can be serviced. This may sound like a niche but there are many workloads that need this kind of setup, and those tend to be fairly lucrative markets.
Since the TX2 is a modern server SoC, we need to take a look at one of the most underrated bits of server technology, RAS. Cavium claims “Server class RAS & Virtualization” along with modern power management features. SECDED ECC, SDDC, data poisoning, address parity protection, demand and patrol scrubbing, and failed DIMM identification are all part of the feature set. This is a fairly solid offering and probably won’t cause any buyers to turn away but it does lack a little on the very high end of the RAS game versus Intel. Lets just say it lacks nothing for mainstream buyers and is quite impressive for a first effort.
Without specifics it is hard to know if anything in the RAS area is a standout for good or bad, and the same holds true for power management. Cavium claims extensive power management with a separate system power controller, full DVFS, fine grained active power management, and power gating. On top of this the CPU is built on a TSMC 16nm FinFET process which makes playing in the same ballpark as Intel’s 14nm offerings all the more impressive. If Cavium can make a Thunder X3 on 10nm or even 7nm before Intel shrinks their Xeons, ARM servers may step ahead of Intel in raw performance.
In the end what do we have with the Thunder X2? We have an ARM server that can run with the big boys, Intel and AMD, on a socket level if not on a single thread level, and lacks little if anything on features. This isn’t to say single threaded performance is weak, it isn’t, but it is the single area that Intel can still shine at. For now. Cavium’s Thunder X2 is more than good enough for the overwhelming majority of the market.
Then there are the partners, and there are quite a few. Gigabyte, Cray, Inventec, Foxconn, HP, and ATOS among others. Throw in all the major Linux distributions, Redhat, Ubuntu, and Suse, plus FreeBSD and Microsoft, and you have more than enough selection to fill your needs. From reference designs and OCP compliant chassis to middleware and applications, Thunder X2 has a lot of very real support. This is why we started out by saying the launch last week was the first real server, but this time we won’t put the word real in single quotes. You probably understand why by now.S|A