AMD IS FINALLY starting to talk about its Fusion CPUs, specifically the first one called Llano. The bad news is that it is not saying very much, but there are some interesting bits that leaked out at ISSCC 2010 in San Francisco.
All the questions that people wanted answered, how is the GPU integrated, how many shaders, performance estimates and similar metrics sadly were not disclosed. ISSCC is about the circuits themselves, how they are made, and why the techniques are important.
At a talk entitled “An x86-64 Core Implemented in 32nm SOI CMOS“, AMD’s R Jotwani answered a lot of questions that started with how, but very few that began with what. Luckily, those are some of the more important bits that rarely get covered in a mainstream CPU launch.
Some of the ‘what’ questions were answered, like the fact that Llano uses a mildly tweaked version of the current K10h core found in AMD’s Shanghai and Istanbul CPUs. The initial variant has four cores and adds a GPU to the mix. The GPU is based on the current ‘Evergreen’ DX11 cores, but how many shaders there are is an open question right now.
One questions we asked AMD representatives at ISSCC was about voltage domains. The GPU on the die has it’s own voltage and clock domains so it is very unlikely to be running at the full 3+GHz clocks of the CPU core. Since the shaders are the the same as are found in the current market leading GPUs but are built on a brand new process, and are integrated into the core, the current 800-900MHz range for ATI GPUs does not mean very much.
That brings up another question, the socket. Looking at the slides leaked by AMD, the chipsets for the higher end Zambezi CPU carry over through 2011. Llano on the other hand goes from an ATI RS880 northbridge and a SB8xx southridge to just a “Hudson-D” series southbridge. Note the lack of any northbridge. This means that Zambezi likely uses the rumored AM3R2 socket, but Llano will almost assuredly use a new socket.
The core itself is changed a bit, but if you are familiar with the current 45nm K10h parts, you will feel right at home. AMD upped the L2 cache to 1MB per core, up from the current 512K, but it maintains the current 16-way associativity. The instruction window is enlarged to 84 entries so things should be a bit more efficient, and the instruction scheduler is now 30 entries for Integer, 36 for FP.
Hardware integer divide is said to be improved and latency for FP instructions has been reduced as well. To fill these windows, there is a better prefetcher, cache lines transition between states faster, and memory fill speed is increased. The TLB is also improved for better residency. Although these little details may not seem like all that much, a percent or three here and there adds up to quite noticeable improvements when everything is added up.
A Llano core (Picture courtesy of the ISSCC)
The pictures released by AMD show only a single Llano core, not the entire chip, nor do they have any of the uncore, unless you count the L2 as uncore. It is almost as if it doesn’t want the interesting parts out yet. That said, each core is more than 35 million transistors and occupies 9.69mm^2, and 110 million transistors and 17.7mm^2 if you count the L2 and power gating ring.
Llano is built on AMD’s new 32nm High-K Metal Gate (HKMG) Silicon On Insulator (SOI) process, and uses 11 metal layers, the same as Shanghai and Istanbul. The only change is that Metal 3 was reduced in pitch, and a lower K dielectric was used. On the silicon side, AMD is using dual strain liners, eSiGe, and some long-channel transistors to increase performance. The process also uses its second generation of immersion tools to draw the pretty lines on the wafers, think underwater basket weaving on a sub-micron scale with multi-million dollar tools.
Power use is probably the overriding factor in modern chips, and AMD made a lot of changes to Llano to reduce power draw. It officially cites three main architectural changes – core power gating, digital APM (power management), and a clock grid redesigned for reduced power use. On a more granular level, SOI brought some major changes to the circuits themselves.
One of the most changed circuits is something called a Delayed-onset Keeper. This circuit was necessitated by changes in the electrical characteristics of the 32nm HKMG SOI transistors themselves, since the ‘old way’ would not work all that well on the new process. The Delayed-onset Keeper improves slack and lessens leakage, but how it works is beyond the scope of this article.
Another big circuit change is in the L1 cache cell, which moves from a double-pumped 6T design to an 8T design. Mirroring some of the changes that Intel made from the 90nm to 65nm P4’s, AMD is trading off a smaller and more complex design for a larger but much more robust one.
The current cache dates back to the K8, and is double pumped to allow two loads or stores per cycle. While this method works, it is complex and after six or seven years, has become a bit limiting. The solution trades complexity for 33 percent more transistors and the area they consume. Since the L1 is only 128K (64K Data, 64K Instruction), it is noticeable but hardly blows out the die budget. You can see the size on the die shot above.
Latency does not change, it is still 3 cycles, but the changes should allow the chip to scale to much higher clocks. Since this L1 architecture will likely live on for years to come, the change is almost mandatory to avoid a nasty ceiling on future clock scaling. This was necessary for future cores, Bulldozer in particular.
On the architectural level, the biggest change is called Core Power Gating, something Intel introduced in Nehalem. The idea is simple, even when off, as long as power can get to a transistor, some of it will get around the gate and be lost. This is conventionally known as leakage, and has become one of the most troubling problems in modern chip building.
As process geometry shrinks, the silicon gates get smaller, and more electrons get through them. High-K Metal Gates improve this, but don’t stop leakage entirely. Until you stop electricity from getting to the transistors, they will always leak a little. A few hundred million ‘littles’ add up to light bulb territory for lost power and heat generated, something first postulated by Fermi if we recall correctly.
Intel and AMD have come up with a solution. They put a ring of transistors around the core itself, it is the black border labeled PG ring in the picture above. What it does is when a CPU goes into the new C6 sleep state, all internal data is saved to off-core DRAM, and the core is powered down completely.
It does not run slowly, the power gates turn off power to it entirely, and then those 110 million transistors stop leaking. This can be a huge power savings, AMD claims a 10-fold reduction in core leakage.
How it was done is pretty interesting as well. SOI is known to be better at preventing some kinds of leakage, and in this case, it is a huge advantage. AMD can use NFETs for the ring instead of the larger PFETs. Since the ring is 1.38 Million transistors per core, smaller is a good thing.
Power Gate edge (Picture courtesy of the ISSCC)
One interesting bit is the ring is not evenly shaped. You would expect that the ring would be between the bumps supplying power to the core and uncore. Drawing a box around the core would not give it enough area for the power bumps needed to feed power to the core. AMD was essentially pad/bump limited if it didn’t want to overload the bumps.
To fix this, the ‘ring’ was changed from a rectangle to a rectangle with a sawtooth edge. That sawtooth allowed the designers to fit 50 percent more bumps under the ring, giving them the desired safety margin on the bumps. If the numbers were run correctly, AMD will never have an Nvidia-esque “end user usage pattern” moment.
The next big one is digital power management (APM), basically being able to read how much power the CPU is using on a time scale that is less than “thermal time frames”. That is the politically correct way of saying that it will catch power spikes before the magic smoke that makes circuits work is set free.
Digital APMs were chosen because digital monitoring can provide accuracy within 2% while sampling about 100 signals. The other methods, amperage and temperature monitoring are far less precise. Llano digitally samples 95 separate signals and achieves better than 98% accuracy.
Temperature measurements are environmentally sensitive while not being repeatable and reliable. Ammeters are better but are still temperature sensitive, and vary on a part by part basis requiring individual calibration. Digital APMs were the only sane course and allow for ‘turbo’ functionality should AMD choose to implement that on a future core.
Finally, we have a clock distribution network that is designed to minimize power use. The clock grid is literally a grid of wires that bring the timing clock pulses to circuits that require them. There are thousands of these across a core, so the grid needs to touch almost all of the core.
Llano clock grid before and after (Picture curteosy of the ISSCC)
Driving this many high speed signals precisely burns a lot of power, so AMD rearranged the Llano core to cluster things that needed clock signals. The end result is a depopulated grid that barely resembles a grid. The transistors lost massively decrease the clock power used, with a claimed 84 percent drop in clock spine switching power, vastly fewer end clock buffers, and a total of 54 percent less power used for the clocks.
On top of the reduction in clock buffer count, AMD also gated them in a much finer way than ever before. At MaxPower, Llano has 32 percent of the clocks firing but only 12 percent when halted. This massively drops power when the CPU is sleeping. For the whole core, Llano only uses 68 percent of the static power and 84 percent of the dynamic power of its 45nm predecessor at a normalized clock.
Overall, it looks like Llano has brought AMD up to par with Intel’s Nehalem core for power management, but as you might know, the Westmere cores are out now. According to a talk from Intel shortly before AMD’s presentation, Westmere added uncore power gating to the mix, raising the bar a bit.
It doesn’t matter that much in the end, Llano is the last generation of AMD’s ‘stars’ cores, and it will be sent off with a bang. Power is dramatically dropped, clocks are raised, GPUs integrated, and the L1 cache has been updated for the first time since 2003.
What you see in Llano is not just an x86 with an integrated GPU, but the preview of what to expect in the upcoming Bobcat and Bulldozer CPUs. It is the last of the old line and the first of the new line at the same time, and almost every part of the chip has been updated in some way. I can’t wait for AMD to drop the curtain and tell us the rest of the secrets.S|A