Nvidia chips show underfill problems

Bumpgate: Electron microscopes reveal a lot

Nvidia world iconEditors Note: From time to time, SemiAccurate will be republishing some older articles by its authors, some with additional commentary, updates and information.  We are mainly reprinting some of the oft referenced articles that originally appeared on the Inquirer. Some will have added content, but all will be re-edited from the originals as per contractual obligations. You may see some slight differences between the two versions.

This article has had some of the original links removed, and was published on Wednesday, December 17, 2008 at 10:37AM.

WHEN WE TOLD you about the ‘bad bumps’ in the Apple Macbook Pro 15″ models the other day, we expected it to end there. As luck would have it, Nvidia pointed us to a much deeper problem that not only affects the Macbook Pro, but likely affects every other high Temperature of Glassification (Tg) underfill chip they make.

Technical Background:

To understand this article, you really need to understand the problem, so please read the technical three part series( Parts 1, 2 and 3) explaining what the problem is, and where it occurs. The problem lies with Nvidia’s half-hearted response to the problem by only changing the underfill. Nvidia also issued this written statement: “The GeForce 9600 GPU in the MacBook Pro does not have bad bumps. The material set (combination of underfill and bump) that is being used is similar to the material set that has been shipped in 100’s of millions of chipsets by the world’s largest semiconductor company (Intel).”  They said so, both in the bottom of the initial Macbook article, and in a Cnet article here

In that article, Nvidia’s Mike Hara says, “Intel has shipped hundreds of millions of chipsets that use the same material-set combo. We’re using virtually the same materials that Intel uses in its chipsets.”, note the word virtually. The problem is, other than his analogy being purposely misleading and not addressing the problem, that virtually in this case means they missed a key coating. It is NOT the same as Intel, AMD, ATI and everyone else we could talk to used in their chips. Unfortunately for Nvidia, the coating they missed is critical to the life of the chip.

Before we break out the electron microscope again, we feel the need to point out some of the things that Nvidia purposely did not talk about in their ‘explanation’ of the ‘fix’. It is sad to have to point this out, but underfill does not crack, bumps do. The bumps that cracked did so because of a chain of technical factors interacting, explained in the three part story above.

Nvidia changed one of the steps in the chain, and, seemingly, none of the others. This may change the frequency of the bumps cracking, either in a good or bad way, or it may not. It may also introduce a new and much more serious failure mode, and that is what we believe Nvidia may face.

Underfill is basically a glue that surrounds the bumps, keeps them from getting contaminated, and moisture free. It also provides some mechanical support for the chips. There are two properties of underfill, Tg and stiffness. Tg is the Temperature of Glassification, which means the temperature at which it loses all stiffness. Instead of thinking about it melting, think of it turning to jello. Stiffness is how hard it is before it melts or turns to jello.

One unusual property of underfill is that the Tg is related to the stiffness. If you want it to glassify at a higher temperature, it will be stiffer to start with. Lower Tg, softer initial underfill. When designing and making a chip, you have to balance between making it so soft that it effectively does nothing, and so hard it rips the chip apart on first power up. If you do things right, you make it as stiff as you can, but not too stiff. If underfill is too soft, it doesn’t provide enough structure to relieve the strain on the bumps, too hard it will damage the underside of the chip itself.

Dielectric and passivation layers:

Lets move back to how a chip is made. You all know about a silicon wafer, it is a 300mm silicon disc that you essentially draw pretty patterns on. Modern chips have a layer of transistors under multiple layers of metal, all drawn on the silicon. You can see some of this in the micrographs below.

Modern chips have many metal layers, 8 is pretty common for devices like CPUs and GPUs. To prevent the layers from shorting each other out, there is a layer of insulation grown between them, this is called the dielectric layer. The resulting chip is a relatively thick hunk of silicon with a 16 layer or so sandwich on top that goes metal/dielectric/metal/dielectric and so on. It ends up looking like a Roman aqueduct in a cross-sectional view.

Intel chip cross-section

A cross-section of an Intel 90nm chip

In a very simplified explanation, the more insulating you make the dielectric layer, the faster the chip can work. This means low-K materials like Black Diamond are really useful, but they are also very fragile. You have eight of these layers, and they have holes punched through to allow communication between the layers. It isn’t all that strong to begin with, and the holes don’t help. On top of the sandwich, you have a coating called the passivation layer, usually Silicon Nitride (SiN), that is basically a hard ceramic coating that protects things.

Remember, these devices are called flip-chips because when they come out of the fab, they are flipped over, and the bumps go on what was the top. This is then covered with underfill and soldered to the substrate, the green fiberglass thing that most people think of as a ‘chip’. The top is the bottom, and the undefill touches the substrate and the SiN layer.

Because the SiN is pretty stiff, any mechanical strain on it will be transferred into the layers of the chip itself pretty directly. If there is too much strain, the layers of the chip peel apart, and you have what is called catastrophic inter-layer delamination, and that kills chips even deader than cracked bumps.

This means you have to change your dielectric material to a stronger substance to take the stress. Unfortunately, the dielectric layer isn’t just an option you can readily swap-out on an already designed chip. Different material choices for the dielectric layer have cascading effects in the chip design and manufacturing process. This is complicated by the fact that there aren’t that many viable material options to begin with. What you end up with is a limit on the stiffness of the underfill. This is why Nvidia didn’t just crank up the underfill Tg a year ago, it has very serious consequences, most of them fatal to the device, and very limited options for dielectric layer materials.

A good analogy is a light bulb and a steel plate, light bulbs are fragile, steel plates are not. If you hit a light bulb with a hammer, you get lots of little pieces, but a steel plate will shrug it off. If you put a steel plate on top of a light bulb, carefully, and hit it with a hammer, you will not damage the plate, but the bulb will shatter just as if you hit it directly. This is very similar to how the strain gets transfered, and the chip is basically a multi-layer light bulb and steel plate sandwich.

Polyimide layers:

Luckily for chipmakers, there is a third option that allows you to design a chip with a fairly stiff underfill and not tear things apart. It is called a polyimide layer (PI), and it is a relatively thick, we are still talking µm here, coating that you put on top of the last dielectric layer. The PI layer is kind of rubbery. It absorbs some of the mechanical strain so the dielectric layers don’t have to, and it also distributes it over a wider area.

In essence, the PI layer simply protects the chip more. This allows designers to use a stiffer underfill and not tear things apart. Notice I said stiffer, not solid steel. If designers go too far on the underfill, it will transfer too much strain, and chip will still die. It gives designers a bit more leeway, taking more stress off the bumps, but chip designers still have to choose very carefully and test the results to an amazingly high degree.

In the Cnet article, Hara said they changed the underfill, and we will assume that they meant stiffened it, not made it softer. Softening it would only increase the problems they’ve had with bump cracking, and while we may not hold Nvidia engineering in all that high regard, they aren’t abjectly stupid. So, Nvidia changed the underfill to a more ‘robust’ version, and didn’t change anything else. We actually believe this story, mainly based on the parts we have dissected.

All is well, right? Ride off into the coffee shop in the sunset with your new Macbook happily working, Nvidia chips not dying in large numbers. There is only one tiny problem with that ending.

The problem in pictures:

Remember when we said that Nvidia engineering wasn’t abjectly stupid? Scratch that. Remember when we said we were going to break out the electron microscope? It’s time. Remember the part about the PI layer being necessary for stiffer underfills? Guess what?

SiN test chip

Test chip with SiN layer

What you are seeing is the top of the bump, where it contacts the chip. The round light gray part on the bottom is the bump, the darker gray on the top is the the silicon itself. The spotty stuff above the top yellow line is the transistor and dielectric layer sandwich (the aqueduct), and the dark gray area on the right is the underfill.

This chip, a materials test part, has no PI layer, just a SiN coating. You can can see the height of the SiN coating is not even 2µm thick, it is the dark line that crops the top of the bump, and ends at the pad on the chip.

For those of you who have been paying attention, you may notice some clumping on the chip, it is eutectic, not high lead. The clumping is a result of enthalpy, it is a thermal test chip, not a production part, used for heat cycle testing. That is why it clumped up, repeated heat cycles.

Test chip with SiN and PI
Test chip with SiN and PI layers

This next one has the same major components, but you will notice the SiN layer is much thicker, 5 or more times thicker, almost 10µm. That is because it not only has the SiN layer, but it also has a PI layer to absorb mechanical stress. This chip is also a test vehicle, and has eutectic bumps and a higher Tg underfill. We can conclude from this that a typical PI layer is 5µm or more thick, and a SiN layer is visibly thinner. Things may change depending on the fab, materials used, and intended use, but the rough ratio of thicknesses won’t change much.

Nvidia 9600 picture

Nvidia 9600 picture

Last up, we have an close up of the bump from the Macbook Pro’s G96/9600 GPU. It is a high lead bump with, according to Nvidia, a higher Tg underfill. This means that the SiN layer should be under 2µm thick. Check, it is. Then the PI layer should be another 5+µ or so. Che…. Hey, wait a minute, there is no PI layer! No, really, it is not there.

Yeah, you are thinking right, Nvidia simply forgot the one critical layer to make their much vaunted, and on the surface correct, high Tg underfill work. To that, all we can say is that it does indeed seem so. If anyone has a better explanation, and several packaging engineers we interviewed, did not, feel free to chime in.

What this looks like is that Nvidia traded a bump cracking problem for an inter-layer delamination problem. Both lead to a term that semiconductor people call catastrophic failure, something you don’t need an engineering degree to understand.

According to multiple packaging people interviewed for this story, all whom wish to remain anonymous, this is a much worse problem than the bump cracking. Phrases like “abject stupidity” and “how the f[no]k did they miss that” were tossed around, but still, Nvidia seems to have designed their chips without the PI layer.

In these conversations, several scenarios were put forward by the packaging engineers to explain such design choices. None of them posit that it won’t be a problem, they all say that it will, they were simply grasping at straws to say how NV missed this one.

The first scenario theorizes that Nvidia had a bunch of high lead wafers sitting in inventory. When they first learned about the problem, they stopped bumping the chips because they knew where the problem lay, just not why. When the engineers got the go ahead to restart the line with high Tg underfill, they had to use up a few months worth of wafers. Because a PI layer can’t be applied after the wafer is fabbed, they were stuck, so they crossed their fingers and hoped someone like me wouldn’t notice. We noticed, and if everything we hear is true, Macbook owners and a lot of others will notice as well.

The next one is slightly more plausible, that they didn’t have time to properly test those design and packaging choices. A heat cycle test of packaging material takes about 3 months to do, and you can’t really rush it. If the first new parts started rolling out of the fab on July 1, 2008, the first day of Q3, that means they had to go in on the first day of Q2.

Subtract another 3 months prior for time to thermal stress test the solution, and they would have had to start around the first day of Q1/08, basically they flipped the switch with a New Year’s hangover. If this was discovered in the fall of 2007, maybe even late summer, there was only 1Q to figure out what the problem was, research alternatives, and make test structures. There wasn’t time for a second round of tests unless they knew about the problem far in advance of what HP and Dell admitted to in 2007.

The most likely way this would have played out is Nvidia tested the structures and nothing worked out well. They gritted their teeth and took the most promising option, no PI. The other scenario is that they didn’t figure it out early, were rushed to come out with a ‘fix’, because Jen-Hsun had to file an 8-K and let the public know. Not having an answer and a fix in hand would not be compatible with executive egos, so they had an answer, they just couldn’t say definitively that it would work.

In either case, the length of testing is probably what bit them. It is a long and intricate process to stress test chips like this correctly. Nvidia has shown with the initial bad bumps problem that they botched this across multiple generations, so why should we give them the benefit of the doubt this time? The more interesting question is when they knew what.

Next up, we have the long shot scenario, that Nvidia packaging engineers, if they actually have them rather than outsourcing everything, simply missed an entire branch of science. They all took a class on semiconductor engineering, and slept through that day, together. And didn’t read the book.

One last thing to toss into the mix, cost. The PI layer is expensive, it adds about $50 to the cost of a wafer. Wafers from TSMC on a high end process cost about $3-5000 depending on a lot of details. Adding the PI layer increases cost of silicon by a noticeable amount, and adds to the defect rate.

For cards that sell to big OEMs for $30 or so, silicon can’t be more than a few dollars of the total cost of the card. Adding $.25 to the chip is a big deal, it can mean the difference between profit and loss for the entire run. One engineer suggested that Nvidia might have shot down the PI layer on cost grounds, but we don’t buy that, they aren’t that desperate, are they?

Analysis:

What does this mean? Unlike what Nvidia has been implying, we have never stated that the ‘bad bumps’ in the Macbook Pro 15″ would cause a failure. We simply stated that they are using the same material that caused failures in the older Macbooks, several HP and Dell lines, and likely many more that are not admitted to publicly. The consumer has a right know this about the products they are buying, and Nvidia steadfastly refuses to tell them.

This time, we see a potentially much more serious problem, and no doubt it will be explained away with pseudo-science and sound bites. Tame journalists and bloggers won’t bother to questions the science, won’t understand it, and will take the easy pre-packaged explanation Nvidia will likely hand them. No problem will ever be admitted to, and the problems that Macbook and other computer owners encounter will be something else, a one off, trust them. Really. Apple did.

Once again, this is not saying that they will fail, or the one you have will fail. We are simply stating that according to all the packaging experts we talked to, none of them could come up with a scenario where this was not a massive problem. Once again, time will tell.

Rebuttal:

In the best of half-hearted PR speak, the Nvidia rebuttal (see Cnet link above) claims the initial investigations of the ‘bad bumps’ was “already flawed”. They won’t say how, but they will toss that out in an attempt to tarnish the evidence. They also won’t say what parts are affected, so there is no way to tell for sure. If our investigation is so wrong, why cover it up?

As for all high lead bumps being bad, that is simply not true, not once was that stated.  We stated that given a chain of engineering failures, bad choices, and inadequate testing, these parts are failing. There is a long chain of events that causes the failures, read the three part technical explanation above for more.

Nvidia is claiming that they changed the underfill material, and had Dawn sprinkle a little green fairy dust on them, and that makes everything OK. Every engineer interviewed disagrees. It is clear that they missed a critical step in making these chips, so changing a single step in the chain will very possibly make matters worse.

If you look at what the higher Tg underfill does, it moves strain off the bumps, and puts it on the SiN layer, which transfers it to the fragile dielectric layer. Nowhere does Hara say that they attempted to reduce the strain that causes the failures in the first place, much less accomplished the goal. In fact, he admits the opposite, unless we misinterpret the statement, “What we did was, we just simply went to a more robust underfill.” This is a band-aid, applied by a fairy, and sprinkled with pixie dust. Sadly, it does not appear to be a thoroughly engineered fix.

When Nvidia says, “The material set (combination of underfill and bump) that is being used is similar to the material set that has been shipped in 100’s of millions of chipsets by the world’s largest semiconductor company (Intel).”, they are right, it is similar. Similar is NOT the same, and the devil is truly in the details. He is right that every semiconductor manufacturer that uses a high Tg underfill uses a similar recipe, but all of them interviewd, every single one, uses a PI layer in conjunction with the high Tg underfill. Period.

The man behind the curtain:

Last up, Nvidia is strongly hinting, like in this Gizmodo article that there are some nefarious forces behind this, and that electron microscopes are hard to come by. The implication is that we couldn’t pull The Big Picture Book of Science out of a paper bag with a map, flashlight and guide dog.

It may be true that we are not up on the latest techniques in electron microscopy, but my years of college going from chemical engineering, to chemistry, to biology, to genetics weren’t a total waste. Reading the output from a spectrograph isn’t that tough when you have been holed up in a lab doing it on related devices for years.

That brings up the crack on electron microscope scarcity. They really aren’t that uncommon, it is just that he probably doesn’t know where to look. The author lives quite close to the University of Minnesota, and last time the author attended actual college courses there many years ago, there were tons of them kicking around, some better than others.

Every major semi house has at least one, likely many, they are indispensable research tools. How many does Nvidia own? We don’t know, but stories like this don’t seem to imply that they are all that uncommon. In fact, this author has seen dozens in tours of companies around the valley. In defense of Hara, he is an IR person, the SEMs at Nvidia are probably on a floor without executive washrooms.

Hara blames Nvida’s competitors for being behind the story, and that is quite plausible on the surface. I mean, Nvidia are cuddly, nice and honest, who wouldn’t like them. I mean, Nvidia openly declared war on Intel, they go out of their way to antagonize AMD, treat the press like dirt, and play partners off each other. A better question would be, at this point, who actually likes them? If you answered Joel Turnipseed, the guy in Iowa who lost all short term memory in a car accident in 2004, you got the one.

One other thing that Hara doesn’t appear to realize is that there are a few dozen teardown houses within an hour’s drive of his office. Companies like Nvidia use them all the time when they want plausible deniability, a ‘second opinion’, or to dodge some trade secrets laws. In fact, most semi companies use them regularly.

Some of them are public, others less so. A quick search for ‘chip reverse engineering’ should net you a dozen or so in very little time. To quote a friend from a large semi house, “The good ones don’t have names”.

What they do however have, is a lot of expensive equipment, like the electron microscopes that are so craftily hidden at Nvidia HQ. They also know how to use them, well. One last thing, their business is quite ‘peaky’, when a new chip comes out, they may tear it down, or tear down a few, and make a report. These reports sell for a lot of money, and that tides them over until a new part is released. During these times, some of them sit around bored, throwing darts at pictures of their former employers, some are busy 24/7. It simply depends.

What it comes down to in the end is that there is simply no shortage of companies, large and small, public and shadowy, that do teardown work. It really isn’t all that hard. There is also no shortage of companies that hate Nvidia, when a company sets out to piss everyone off, they often succeed. The list of capable organizations with motives is not short, in fact it is very long.

Then again, it was my idea to begin with. When a company responds to an easy direct question with dodgy doublespeak, or answers another seemingly related question, alarm bells go off. Having solid information about the chips before you ask the question immensely aids in analyzing the PR/IR output. The bells went off this time, and the digging started. Several mad scientists liked the idea, and agreed to help out as time permitted. It took two months, but the results were worth it.

Conclusion:

In the end, what you have once again seems to be a massive design and engineering failure. This could, but not necessarily will, lead to inter-layer delamination failures. The Macbook Pro 15″ GPU undoubtedly has the problem, and it is very likely that every Nvidia chip with high lead bumps and high Tg underfill does as well. We are still analyzing the eutectic bump parts, and will follow that with a report if we get anything conclusive.

Nvidia is still stonewalling the first problem, and likely won’t admit to this one unless they are forced by law to file an 8-K once again. Remember, the last admission was not voluntary. Once again, we will state the obvious, Nvidia needs to come clean over this, admit what models are affected by the bump cracking, what computers the chips went into, and what chips are affected by this latest missing layer. Then the customer can decide.S|A

Note: Apple was once again called twice and informed that there was a potential problem prior to publication. Instead of calling us back to tell us that they knew about the issue, and had dealt with it, or would stand by their customers, they simply ignored us once again. Because of this, we award them the Steve Jobs Memorial Turtleneck for Pride and Arrogance (SJMTPA) for turning a potential good situation to mud. Own goal guys, 0 for 6!

Author’s Note: To date, I have not seen or heard any numbers on the MacBooks with bad bump 9600s in them. It will be interesting to see if the warranty claims are different from the MacBooks with high lead bumps versus the eutectic bump ones. Anyone want to send me a copy of that info?S|A

The following two tabs change content below.

Charlie Demerjian

Roving engine of chaos and snide remarks at SemiAccurate
Charlie Demerjian is the founder of Stone Arch Networking Services and SemiAccurate.com. SemiAccurate.com is a technology news site; addressing hardware design, software selection, customization, securing and maintenance, with over one million views per month. He is a technologist and analyst specializing in semiconductors, system and network architecture. As head writer of SemiAccurate.com, he regularly advises writers, analysts, and industry executives on technical matters and long lead industry trends. Charlie is also available through Guidepoint and Mosaic. FullyAccurate