Editors Note: From time to time, SemiAccurate will be republishing some older articles by its authors, some with additional commentary, updates and information. We are mainly reprinting some of the oft referenced articles that originally appeared on the Inquirer. Some will have added content, but all will be re-edited from the originals as per contractual obligations. You may see some slight differences between the two versions.
This article has had some of the original links removed, and was published on Monday, September 1, 2008 at 9:46PM.
GETTING BACK TO the underfill, it is probably the key element of Nvidia’s problem. There is one more property of underfill called the glassification temperature, Tg for short. Tg is not melting point, it is the temperature that a material goes soft and loses most of it’s structural rigidity, think of jello. The underfill that Nvidia used, Namics 8439-1 is what’s called a low Tg material, and the Hitachi 3730 has a higher Tg.
To be fair to Nvidia, about the time when the G84 and G86s were hitting the market, high Tg underfills were pretty rare and new to the market. Low Tg underfills, such as the Namics material that NV used were available for a while, and were ‘known’. The last thing you want to do is put a high risk part on a new and market untested material, so it looks like they went with the safe choice, low Tg.
If Nvidia did their homework right, the Tg of the material should never be hit, the chip should always run below that temp, and the underfill should provide the mechanical support needed to keep the high lead bumps from fracturing. This is why you engineer, test, retest, simulate, pray a lot, and pick your materials very carefully.
Namics 8439-1 underfill temp vs strength curve
Here is the Tg curve for Namics 8439-1. Let me be the first to say there appears to be nothing, I repeat, nothing wrong with this material, it does exactly what it says it does. It starts to lose strength at about 60C and by a little over 80C it has 100 times less rigidity. What temps do GPUs run at again? What is the Tj (junction temperature) for them? Ooops. Big hundreds of millions of dollar ooopsie right there.
So, the failure chain happens like this. Nvidia for some unfathomable reason decides to design their chips for high lead bumps, something that was likely decided at the layout phase or before because the bump placement is closely tied to the chip floorplan. At this point, they are basically stuck with the bump type they chose for the life of the chip.
The next choice was the underfill materials, and again, they chose the known low Tg part that had far lower tolerances than the newer-to-the-market high Tg materials. It was a risk vs risk proposition, likely with a lot of cost differences as well. They chose wrong, very wrong. The stiffness of the Namics material might be perfect below the Tg, but once you hit that Tg, it is almost like the Namics material isn’t there, and the stress transfers to the bumps while they are hot and weak.
Fanbois will cry that their $.23 temp sensor is reading much lower temps than that, so there is no way this could be an issue. Well, the temp sensors on many cards are not on die, much less between the die and the substrate. They are also cheap and notoriously inaccurate. To top it off, they only measure average temp across the chip, not hot and cold spots. If you look at the IR photo above, you can see that if you move the sensor from the right side to the left, you will get vastly differing readings. In this case, a real current chip, it will vary by as much as 30C depending on placement of the temperature sensor.
Many people also don’t realize that it is easier for heat to travel down through the pins, they are mini-heat pipes, than it is to cross the three thermal barriers (die -> thermal paste -> heat spreader -> thermal paste -> heatsink) to the heatsink. That means those little bumps take a huge thermal pounding, and are usually hotter than the surface of the heat spreader.
To make matters worse, the bumps that are under the hot spot get hotter still. Piling on the pain, they carry the most current, and the hotter things get, the more heat they generate, and the more resistance they usually have which runs into a feedback loop creating even more heat.
Could it get worse? Of course it could. Remember thermal stress? The expansion is highest at the point, wait for it, that is hottest. That would be under the hot spots, and it puts the most stress on the bumps that are weakest.
This is why you have to pick your underfill very carefully, you have to relieve as much stress as you can from the bumps. Too little and they go snap, and the chip dies. Too much and you pull the polyimide layer off and the chip dies. Basically, you go as stiff as you dare, then test the hell out of it under the conditions your simulations tell you will be present. Test test test test or dies die.
When the underfill glassifies, it means you are at the hottest point on the die, and the bumps that it is protecting are under the most heat, pulling the most current, and under the most thermal stress. If the underfill essentially turns to jello, it is very bad. If you compound that by using bumps that bond poorly to the substrate, it makes things worse. If those bumps are also stiffer than the other option, it is worse yet.
Lets go down the checklist for Nvidia. High thermal load? Check. Unforgiving high lead bumps? Check. Eutectic pads? Check. Low Tg underfill? Check. Hot spots that exceed the underfill Tg? Check. If you are thinking this looks bad, you are right. If it was just as simple as the underfill glassifying, the parts would have never made it to market, but it is much more complex than that. The problem with thermal stress is that it is somewhat additive, it weakens parts long before they actually break unless it is quite extreme.
An example of extreme thermal stress would be to take a glass cup, preferably non-tempered, and put it in the oven on max. Pull it out and drop it in a bucket of ice water, and voila, instant thermal stress demonstration. Wear eye protection. The thermal stress that the bumps experience is much more like the fork example earlier, it gets weaker and weaker with each bend, until snap, black screen.
If you recall, the Nvidia parts are breaking at the bump to substrate connection. This is the weakest point in the chain, and it is where they made the worst possible materials choice. It is not really a surprise that it failed. It is simply piss-poor engineering.
So, what can be done by Nvidia at this point? Well, changing to high Tg underfills is a start, as is changing to eutectic bumps. The high Tg underfill option has come down in risk substantially since the G84 and G86 parts were introduced, so that is a no-brainer, and guess what Nvidia did for the G86? And the G92 as well.
The problem of changing bump types is far thornier, but Nvidia is doing that as well. From the intelligence we have been able to gather, Nvidia has not made any power distribution changes to the parts. There seems to be no power grid, no power plane, nothing to protect the eutectic bumps from electromigration.
This is emblematic of the ‘pants are on fire’ school of engineering, and reports from inside Nvidia confirm that they are in full panic mode over this snafu. With short time horizons to fix a massive batch of defective parts, reliability engineering usually draws the short stick.S|A
Latest posts by Charlie Demerjian (see all)
- How many PCIe5 lanes does Sapphire Rapids have? - Mar 20, 2020
- What does Intel’s server platform cancellation mean? - Mar 16, 2020
- A few more bits on Intel’s server cancellation - Mar 16, 2020
- Intel kills off a server program - Mar 14, 2020
- When is Samsung’s 2nd gen zNAND due? - Mar 10, 2020