Editors Note: From time to time, SemiAccurate will be republishing some older articles by its authors, some with additional commentary, updates and information. We are mainly reprinting some of the oft referenced articles that originally appeared on the Inquirer. Some will have added content, but all will be re-edited from the originals as per contractual obligations. You may see some slight differences between the two versions.
This article has had some of the original links removed, and was published on Monday, September 1, 2008 at 8:00PM.
NVIDIA HAS RECENTLY been saying a lot about how its chips are not bad and giving people reasons why the problem is contained. Unfortunately, Nvidia’s disingenuous half-truths don’t stand up to an analysis of why this problem is happening.
The problem is extremely complex and defies a simple explanation. It involves multiple poor choices, multiple engineering failures, and likely a few bad accounting choices. This piece could also have been titled, “More than you ever wanted to know about bumping, and then some: How not to do things”. We will simplify the science and technical details as much as possible to make it accessible, so some things may be oversimplified.
The defective parts appear to consist of the entire line-up of Nvidia parts on 65nm and 55nm processes, no exceptions. The question is not whether or not these parts are defective, it is simply the failure rates of each line, with field reports on specific parts hitting up to 40% early life failures. This is obviously not acceptable.
The end result of the failures is that bumps crack between the bump and the substrate on a chip, not on the bump to die side. When this happens to a signal bump, game over for the GPU or MCP. What is a bump, die and substrate? Why is it happening? That is a long and technical story.
A VIA/Centaur CN/Nano chip. Note the silicon die in the center.
First lets start out with some terminology, illustrated here by the lovely and talented Via CN/Nano chip. As you can see, the total package is about the same area as a US quarter, but also much thinner. The most important part is the black square at the center, that is the die, or the silicon chip itself. The green fiberglass-like part around it is the substrate, a complex multi-layered organic material that routes signals from the pads on the top to the pins on the bottom, and serves as an attachment point for the die and various passive components. The passive components are the little silver things around the edges.
The die itself looks a little smooth and rounded at the edges, but in reality it is very very angular. It has four 90 degree corners, this one is almost square, but some, the Intel Atom for example, are much more rectangular. The blurry edges are due to a material called the underfill, it looks like glue seeping from the edges, and serves as mechanical support for the die/substrate bonds and a moisture barrier to protect the bumps.
Side view of the same VIA/Centaur chip
The part you don’t see are the bumps and they are the most critical part. This type of packaging is called flip chip because the connectors between the die and the substrate are put on top of the die, and it is flipped over onto the package. The connectors are called bumps, and they are literally little balls of solder. A typical chip that is a little more than a centimeter on a side might have over 1000 bumps on it, so spacing is incredibly small and tolerances amazingly tight. These bumps provide various services for the die, including supplying power, signaling, and attachment to the substrate.
As you can see, the entire package is about the same height as a quarter as well, so the vertical heights are also pretty slim, pun intended. The bumps act like pins on a normal chip, they carry signals, power and ground to and from the die. They also are the primary attachment mechanism of the die to the substrate. The precision needed to put these things together should not be underestimated.
Those are the players in our little drama, now lets move on to some basic physics and related science. Chips consume power, and in return they give you heat and a few electrons in the right places, occasionally they also give you a flash of light and smoke as well, but few chips do that twice. Heat is not an intended product, it is a consequence, and has to be carried away or bad things happen.
Modern chips consume electricity in an uneven manner, as different parts of the chip use power at different rates. Sometimes parts of the chip are never used at all for a given workload. If you have a modern GPU and don’t game or are smart enough not to run Vista, you will likely never touch the transistors that do all the 3D work. Think about it this way, there are hot spots on the chip as well as cold spots, these are uneven and changing constantly.
A thermal photo of a typical multi-core CPU
Related to this is the fact that the chip uses electricity in a non-uniform manner. Parts that are heavily used pull much more amperage than idle parts, and once again, those parts change over time. Some bumps may pull a lot of power, others may pull very little, and this again changes over time and with different uses. Each bump also has a limited current capacity, too much and they melt or burn out, so there are far more bumps on a chip than are strictly needed to supply the chip with power.
The idea is not to have only as many bumps as you need to carry the TDP of the chip plus a little tolerance, but to make sure no one bump will ever reach the maximum current it can handle. This is done by putting in far more power bumps on the die than are ever needed from an average current point of view. If things are done right, no single bump will ever exceed the maximum current it can deliver. There is a lot of science here, it is not just simple over-provisioning.
The Nvidia defective chips use a type of bump called high lead, and are now transitioning to a type called eutectic. Eutectic materials have two important properties, they have a low melting point and all components crystallize at the same temperature. This means they are easier to work with, and form a good solid bond. Eutectic bumps may have lead in them, or they may not, some are gold/tin, other are lead based, it depends on what characteristics you want, and how much you want to pay. It is a property, not a formula.
Most if not all substrates use eutectic pads to attach the bumps to as well. If you use a eutectic pad with a eutectic bump, you get a much better connection than you do if you use a high lead bump with a eutectic pad. This is reflected in much higher yields, lower assembly costs, and a physically stronger connection as well. At this time, we have no good explanation as to why Nvidia chose to go the high lead bump on eutectic pad route.
High lead bumps have a much higher current capacity than eutectic bumps. When power is run through eutectic bumps, you get an effect called electromigration. This means that some of the materials are essentially pushed around by the current, and you get voids in the bump. These voids lessen the capacity of the bump, and eventually they burn out.
The more current you run through a eutectic bump, the quicker the electromigration. If you keep the current to a reasonable level, the time it takes for this to happen will be so long it isn’t worth worrying about. This is why chip vendors say that upping the voltage will shorten the lifespan of parts, it literally causes them to burn out quicker.
On the good side, eutectic bumps are generally more flexible than high lead. This means they are a bit more forgiving of stress. Some forces that would fracture a lead bump may be absorbed by a eutectic one without problems.
Bumps overall are a multi-dimensional tradeoff between cost, assembly yield, current capacity and mechanical resilience among other things. To call it a complex mess is being overly kind, package engineering is not for the faint of heart.
From bump properties, we move on to thermal expansion of materials, and that is another piece to the puzzle. Most materials expand as they warm up. If you have ever seen a mechanic trying to free a stuck bolt, they usually heat the nut with a blowtorch, this expands the nut and loosens it. The same thing happens with the die and substrate. When you turn on a chip, it heats and expands a little. This expansion is not much, but it is measurable. The substrate also heats and expands.
The problem is that the die gets hot, and heats the substrate secondarily. The silicon on the die has one rate of thermal expansion, the substrate has another, basically they get bigger at different rates. To complicate things further, remember the uneven and changing heating patterns discussion above? Parts of the die heat up and expand differently from other parts of the die. This changes quite quickly while in use.
The result? The bumps take a lot of stress, and it changes from second to second. This can be very accurately simulated, and you can engineer bump placement at points of lower thermal expansion and therefore lower stress. If you lose a power bump here and there, the chip will very likely survive, but if you lose a signal bump, game over. This is why bump placement is very important.
Designing what bumps go where is a very complex process, and is done basically when the chip is laid out, near the end of the development process. You don’t do it on a whim, you don’t make pretty patterns because they are cool, you do it scientifically to minimize the potential for damage.
Getting back to the stress, it is what makes bumps fracture. Think of the old trick of taking a fork and bending it back and forth. It bends several times, then it breaks. The same thing happens to bumps. Heating leads to stress, aka bending, and then it cools and bends back. Eventually this thermal cycling kills chips.
Once again, if you did your engineering right, this won’t happen in any time frame that matters to mere humans. If it takes 10 years of on and off switching to make it happen, once a day power cycling won’t matter in our lifetimes. Chip makers tend to engineer for timelines like the 10 year horizon, so they are pretty safe in assuming their designs will live for 5 years of casual use.
If you recall, high lead bumps are stiffer than eutectic and more prone to stress fractures. The high lead to eutectic substrate bond is also weaker than a eutectic to eutectic bond. What is happening to Nvidia’s chips is that the substrate to bump joint is cracking, and the chips die. High lead bumps are a poor choice to use in this application.
One other bit to bring into the mix is underfill. If things were as simple as heat leads to cracking, no chips would work for any length of time. Underfill not only protects the bumps from moisture and contamination, but it also provides mechanical support as well. It is designed to take some of the stress that the bumps take, making them live longer.
Underfill can range from rock hard to soft and squishy, it depends on your application. The harder the underfill, the more mechanical support it provides, and the less stress the bumps take. Simple enough.
That brings us to another material, the Polyimide layer. When chips went to a low-K dielectric material, which is not the same as the high-K gate material, it proved a problem with packaging, bumps and underfill. The solution was to put a polyimide layer, sometimes called a stress layer, to cover the bottom of the chip. This prevents contamination and mechanical damage.
Pick an underfill that is too soft, it won’t provide enough mechanical support, and your chip dies an early death. Pick one that is too hard and it rips the polyimide layer off. In the words of one packaging engineer interviewed for this article, if you used too hard of an underfill, the chip “wouldn’t survive the first heat cycle”. The magic is in the middle, you have to pick a bowl of porridge, I mean underfill, that is strong enough to provide the support you need, but not strong enough to rip layers off your chip. Like I said, package engineering is not for the faint of heart.
Tg and Nvidia underfill
That brings us to the billion dollar question, why not simply change bump types to eutectic if they are that much better, which they are, in some ways. The answer is in the current capacity, more specifically average current capacity. We mentioned this earlier, and the idea ties into the hot spots and functional units.
If you take a hypothetical simple CPU that has an integer and floating point pipeline. If you are doing heavy integer work, the power bumps that supply that part of the chip will be loaded heavily and the floating point bumps will not be doing much of anything at all. When floating point load gets heavy, the opposite happens.
The layout of the bumps is designed so that neither set will be overloaded at peak times, and in fact won’t get all that close to their maximums. To use completely made up numbers, take a bump that has a peak capacity of 1000mA, and for longevity you don’t want to exceed 800mA, basically a 20% safety margin.
If the chip TDP divided by the number of bumps, IE the average current per bump is 200mA, then there are likely many bumps drawing 100mA and a few heavily loaded areas that draw 600mA. This draw moves around with the work the chip is doing. Some may never break 100mA, others may be at 600mA for their entire lives. All are well below the 800mA average, much less the 1000mA max.
The problem with eutectic bumps is that they have a lower current capacity, and the closer you get to it, the worse the problem of electromigration gets. Lets pick a hypothetical eutectic bump that has a peak capacity of 500mA and the same 20% safety margin, 400mA for long life. If Nvidia wants to swap in eutectic bumps for the high lead they are using, there is a slight problem, they are well over the current capacity of the new bumps.
If the chip actually powers up without letting smoke out, the first time you fire up a massive game of Telengard, it will most assuredly go pop. In the rare case that the gods of luck are staring right at you and the thing doesn’t fry immediately, electromigration will ensure it has the lifespan of a mayfly, basically worse than the current crop of defective Nvidia chips.
What do you do? You can either cut the power used by the GPU way way down, that is, clock it at a point where no one would ever buy it, or rearrange where the bumps are placed. The rearrangement is not a trivial thing, and may require moving large parts of the chip around, basically a partial relayout. This is expensive, time consuming, and likely can’t be done and validated within the time the chip is on sale.
The other option is basically just as bad, you need a power plane or power grid on the die. This is a metal layer that distributes power across the die, and it means adding a layer to the chip. That means expense, slightly lower yield, and can have other detrimental effects to power draw and clocking.
All of these things can be dealt with if you see this coming when you start designing the GPU. It is pretty painfully obvious that Nvidia didn’t, otherwise they wouldn’t have used high lead bumps and gotten in the hole that they are in now. They have switched to eutectic bumps, but given the way it is being done, and the supplier grumbles we are hearing, it appears to be very poorly planned. It will be interesting to see the lifespan of these new parts.S|A