TILERA MAKES CPUS that have lots of cores and a fairly unique multi-mesh interconnect fabric that live happily in embedded devices from AV equipment to Deep Packet Inspection (boo hiss) devices. Today, the company is entering a market that I never expected to see them in, cloud compute servers.
If you are not familiar with the Tilera architecture, read this piece, it is very unusual to say the least. The idea behind the Tilera chips is that you chunk up workloads into pieces, be it chunks of data or code, and spread the work across multiple tiles. You can also section off the chip into chunks of sub-sections, each one being cache coherent with only the tiles working on the same task.
Meshes and cores abound
You could for example do a database sort by having one subset of cores send records to separate cores based on the first letter of a field. From there, the tiles each sort the rest of the data themselves, parsing the workload out across the chip. The other way of doing it is to parse the workload into sub-tasks, and have each tile do a bit, then pass the data to the next tile which does the next bit. This allows you to keep the needed code always cache resident avoiding memory delays. TCP/IP and video decode are two workloads that map to this paradigm well.
That brings us back to today’s announcement, the Tilera S2Q server made by the ODM Quanta. The box is interesting, the tools are interesting, it is just that the intended workload has me scratching my head. Then again, this is ‘The Cloud’, a realm where nothing has rules, servers are bought by the tens of thousands, and performance per watt on a customer specific workload dictates almost everything.
To complicate matters, peak performance doesn’t matter to the majority of customers, fastest is not always best either. Any customer buying 1000+ servers a month is going to have workloads that thread well, otherwise they would be hoarding P4 chips and liquid helium. Because of that, if a chip does a given workload ‘fast enough’, and that is a rather case specific and nebulous term, then it is good enough to pass the initial screening.
What matters to cloud compute hardware customers? TCO or Total Cost of Ownership. If the hardware runs a workload ‘fast enough’, then it is down to the cost of running that workload X times, plus the cost of power, plus the cost of data center space for it, plus the maintenance costs, plus whatever else you need to make the boxes run for the X years you have until they are depreciated. Peak performance, SPEC scores, and just about anything else is not really relevant “in the cloud”.
Once a customer has calculated the cost to run their entire workload on various platforms, that decides architecture they will will buy. If that is an IBM Power 7, Xeon, Opteron, Arm, Atom, 6502, FPGA, or even a Tilera chip, then that is what is purchased. Cloud purchase requirements are 1) TCO 2) TCO 3) TCO 4) The company your cousin started that bought you a really nice car and gave you options under the table 5) TCO.
That is where Tilera starts to get interesting. Tilera has lots of cores, lots of smaller cores that don’t do single threaded tasks with nearly the aplomb that a Xeon does, but can do a lot more. They are closer to a GPU than a CPU in many ways, but have a lot more programmability. You also can shuttle data and code between tiles with far less trouble than you can on an x86 part. It exemplifies the 1024 chickens model of computing.
So why would any sane person pick something like a Tilera CPU over a Xeon or Opteron core that can does well on both single threaded and high throughput loads? Easy, power, density, and ‘good enough’. The Quanta S2Q is a 2U box with four ‘modules’, each of which has 2 64 core TILEPro CPUs and 16 DIMMs on a hot plug card. The front of the box has 24 2.5″ drives, six per node.
One node, four fit in a 2U server
If you recall, Tilera CPUs are fairly close to an SoC that has a lot of bandwidth on and off chip, so the S2Q board is basically a big breakout box for all of that I/O. They have 4 GigE ports, 3 10 GigE ports, 2 management ports, and 2 serial management ports. Sadly, there is no parallel port, so you may have to throw out that Laserjet 4. Yes, we know it still works fine.
Where things get interesting is that each board consumes a claimed 50W max, so a server, loaded with drives would be in the 200-300W range. Compared to the benchmark Xeon, it should have a lot lower power per 2U server machine.
Much like GPGPU, if your workload fits the Tilera architecture, it should absolutely fly on the S2Q. If your app doesn’t fit the Tilera architecture, even the upcoming 100 core Tile-Gx chips won’t make a bit of difference, it won’t run fast. Like SeaMicro, GPGPUs, and a host of others, the Tilera Quanta S2Q is a fairly niche specific machine, but all those niches are not the same.
The next problem is a bit easier to deal with, software. The Achilles heel of most ‘non-standard’ architectures is software and programming. Using GPUs for compute was a fairly painful and non-portable coding adventure until the recent advent of OpenCL, and that stymied growth. Tilera saw this problem long before they had silicon, and haven’t stopped working on software.
For now, you can get a full LAMP stack running the Linux 2.6.26 kernel, Java support, ANSI C/C++ from both the GNU tool chain and Eclipse, and the range of standard tools that go with both of those. They also provide multicore aware profilers and debuggers.
In addition to the LAMP stack, you can use MySQL, Hadoop, Gallery2, SugarCRM and more out of the box, or roll your own. There is a full “commercial Linux distribution” coming soon, but Tilera would not say which one. On top of that, the S2Q also supports virtualization with KVM. 100 cores can probably handle quite a few discrete VMs if memory use isn’t all too onerous, and I/O bandwidth is unlikely to be a bottleneck.
One thing Tilera says they spent a lot of time optimizing are the low level libraries and tools that make Linux work. Recompiling the kernel and basic tool set won’t get you far on a Tilera chip, the differences between it and a monolithic core architecture are too substantial. With some minor TLC, you can take advantage of a lot of the Tilera features, and that is where the company seems to have put a lot of it’s time and energy.
The one benchmark that Tilera puts out there is CoreMark. On that, an 866MHz TILEPro scores about 2100, a 1.25GHz Tile-Gx cores hit about 3200, a 1GHz ARM Coretex-A9 hits about 2900, and an Atom N270 at 1.6GHz scores 3000. All tests were compiled with GCC. Basically, each TILEPro core is in the range of an Atom or A9, not a Xeon.
Next year, the new Tile-Gx cores should be out, and they will up the core count from 64 to 100 per socket, 800 per S2Q. The shrink from 90nm to 40nm should also reduce power consumption by quite a bit, mitigating any additions from higher core counts. In 2013, Tilera has a new generation code named Stratton coming, upping core counts from 100 to 200+. If the Tilera architecture works for you, that should be a pretty compelling upgrade path.
In the end, the MHz, core counts, and benchmarks don’t mean much to potential customers who buy servers by the hundreds, only two things do. First does the application they are running fit the Tilera paradigm, or the other way around? Second, TCO. Third, TCO. Fourth, TCO. You get the idea. It will be interesting to see who the customers are for the Tilera S2Q, if they are ever allowed to mention it in public.S|A