Search  
 

Why Nvidia's Chips are defective

Bumpgate: Part One - A long and complex story

by Charlie Demerjian

July 11, 2010

Nvidia world iconEditors Note: From time to time, SemiAccurate will be republishing some older articles by its authors, some with additional commentary, updates and information.  We are mainly reprinting some of the oft referenced articles that originally appeared on the Inquirer. Some will have added content, but all will be re-edited from the originals as per contractual obligations. You may see some slight differences between the two versions.

This article has had some of the original links removed, and was published on Monday, September 1, 2008 at 8:00PM.

NVIDIA HAS RECENTLY been saying a lot about how its chips are not bad and giving people reasons why the problem is contained. Unfortunately, Nvidia's disingenuous half-truths don't stand up to an analysis of why this problem is happening.

The problem is extremely complex and defies a simple explanation. It involves multiple poor choices, multiple engineering failures, and likely a few bad accounting choices. This piece could also have been titled, "More than you ever wanted to know about bumping, and then some: How not to do things". We will simplify the science and technical details as much as possible to make it accessible, so some things may be oversimplified.

The defective parts appear to consist of the entire line-up of Nvidia parts on 65nm and 55nm processes, no exceptions. The question is not whether or not these parts are defective, it is simply the failure rates of each line, with field reports on specific parts hitting up to 40% early life failures. This is obviously not acceptable.

The end result of the failures is that bumps crack between the bump and the substrate on a chip, not on the bump to die side. When this happens to a signal bump, game over for the GPU or MCP. What is a bump, die and substrate? Why is it happening? That is a long and technical story.

Via CN top view

A VIA/Centaur CN/Nano chip. Note the silicon die in the center.

First lets start out with some terminology, illustrated here by the lovely and talented Via CN/Nano chip. As you can see, the total package is about the same area as a US quarter, but also much thinner. The most important part is the black square at the center, that is the die, or the silicon chip itself. The green fiberglass-like part around it is the substrate, a complex multi-layered organic material that routes signals from the pads on the top to the pins on the bottom, and serves as an attachment point for the die and various passive components. The passive components are the little silver things around the edges.

The die itself looks a little smooth and rounded at the edges, but in reality it is very very angular. It has four 90 degree corners, this one is almost square, but some, the Intel Atom for example, are much more rectangular. The blurry edges are due to a material called the underfill, it looks like glue seeping from the edges, and serves as mechanical support for the die/substrate bonds and a moisture barrier to protect the bumps.

Via CN side view

Side view of the same VIA/Centaur chip

The part you don't see are the bumps and they are the most critical part. This type of packaging is called flip chip because the connectors between the die and the substrate are put on top of the die, and it is flipped over onto the package. The connectors are called bumps, and they are literally little balls of solder. A typical chip that is a little more than a centimeter on a side might have over 1000 bumps on it, so spacing is incredibly small and tolerances amazingly tight.  These bumps provide various services for the die, including supplying power, signaling, and attachment to the substrate.

As you can see, the entire package is about the same height as a quarter as well, so the vertical heights are also pretty slim, pun intended. The bumps act like pins on a normal chip, they carry signals, power and ground to and from the die. They also are the primary attachment mechanism of the die to the substrate. The precision needed to put these things together should not be underestimated.

Those are the players in our little drama, now lets move on to some basic physics and related science. Chips consume power, and in return they give you heat and a few electrons in the right places, occasionally they also give you a flash of light and smoke as well, but few chips do that twice. Heat is not an intended product, it is a consequence, and has to be carried away or bad things happen.

Modern chips consume electricity in an uneven manner, as different parts of the chip use power at different rates. Sometimes parts of the chip are never used at all for a given workload. If you have a modern GPU and don't game or are smart enough not to run Vista, you will likely never touch the transistors that do all the 3D work. Think about it this way, there are hot spots on the chip as well as cold spots, these are uneven and changing constantly.

Thermal pic of a multi-core CPU

A thermal photo of a typical multi-core CPU

Related to this is the fact that the chip uses electricity in a non-uniform manner. Parts that are heavily used pull much more amperage than idle parts, and once again, those parts change over time. Some bumps may pull a lot of power, others may pull very little, and this again changes over time and with different uses. Each bump also has a limited current capacity, too much and they melt or burn out, so there are far more bumps on a chip than are strictly needed to supply the chip with power.

The idea is not to have only as many bumps as you need to carry the TDP of the chip plus a little tolerance, but to make sure no one bump will ever reach the maximum current it can handle. This is done by putting in far more power bumps on the die than are ever needed from an average current point of view. If things are done right, no single bump will ever exceed the maximum current it can deliver. There is a lot of science here, it is not just simple over-provisioning.

The Nvidia defective chips use a type of bump called high lead, and are now transitioning to a type called eutectic. Eutectic materials have two important properties, they have a low melting point and all components crystallize at the same temperature. This means they are easier to work with, and form a good solid bond. Eutectic bumps may have lead in them, or they may not, some are gold/tin, other are lead based, it depends on what characteristics you want, and how much you want to pay. It is a property, not a formula.

Most if not all substrates use eutectic pads to attach the bumps to as well. If you use a eutectic pad with a eutectic bump, you get a much better connection than you do if you use a high lead bump with a eutectic pad. This is reflected in much higher yields, lower assembly costs, and a physically stronger connection as well. At this time, we have no good explanation as to why Nvidia chose to go the high lead bump on eutectic pad route.

High lead bumps have a much higher current capacity than eutectic bumps. When power is run through eutectic bumps, you get an effect called electromigration. This means that some of the materials are essentially pushed around by the current, and you get voids in the bump. These voids lessen the capacity of the bump, and eventually they burn out.

The more current you run through a eutectic bump, the quicker the electromigration. If you keep the current to a reasonable level, the time it takes for this to happen will be so long it isn't worth worrying about. This is why chip vendors say that upping the voltage will shorten the lifespan of parts, it literally causes them to burn out quicker.

On the good side, eutectic bumps are generally more flexible than high lead. This means they are a bit more forgiving of stress. Some forces that would fracture a lead bump may be absorbed by a eutectic one without problems.

Bumps overall are a multi-dimensional tradeoff between cost, assembly yield, current capacity and mechanical resilience among other things. To call it a complex mess is being overly kind, package engineering is not for the faint of heart.

From bump properties, we move on to thermal expansion of materials, and that is another piece to the puzzle. Most materials expand as they warm up. If you have ever seen a mechanic trying to free a stuck bolt, they usually heat the nut with a blowtorch, this expands the nut and loosens it. The same thing happens with the die and substrate. When you turn on a chip, it heats and expands a little. This expansion is not much, but it is measurable. The substrate also heats and expands.

The problem is that the die gets hot, and heats the substrate secondarily. The silicon on the die has one rate of thermal expansion, the substrate has another, basically they get bigger at different rates. To complicate things further, remember the uneven and changing heating patterns discussion above? Parts of the die heat up and expand differently from other parts of the die. This changes quite quickly while in use.

The result? The bumps take a lot of stress, and it changes from second to second. This can be very accurately simulated, and you can engineer bump placement at points of lower thermal expansion and therefore lower stress. If you lose a power bump here and there, the chip will very likely survive, but if you lose a signal bump, game over. This is why bump placement is very important.

Designing what bumps go where is a very complex process, and is done basically when the chip is laid out, near the end of the development process. You don't do it on a whim, you don't make pretty patterns because they are cool, you do it scientifically to minimize the potential for damage.

Getting back to the stress, it is what makes bumps fracture. Think of the old trick of taking a fork and bending it back and forth. It bends several times, then it breaks. The same thing happens to bumps. Heating leads to stress, aka bending, and then it cools and bends back. Eventually this thermal cycling kills chips.

Once again, if you did your engineering right, this won't happen in any time frame that matters to mere humans. If it takes 10 years of on and off switching to make it happen, once a day power cycling won't matter in our lifetimes. Chip makers tend to engineer for timelines like the 10 year horizon, so they are pretty safe in assuming their designs will live for 5 years of casual use.

If you recall, high lead bumps are stiffer than eutectic and more prone to stress fractures. The high lead to eutectic substrate bond is also weaker than a eutectic to eutectic bond. What is happening to Nvidia's chips is that the substrate to bump joint is cracking, and the chips die. High lead bumps are a poor choice to use in this application.

One other bit to bring into the mix is underfill. If things were as simple as heat leads to cracking, no chips would work for any length of time. Underfill not only protects the bumps from moisture and contamination, but it also provides mechanical support as well. It is designed to take some of the stress that the bumps take, making them live longer.

Underfill can range from rock hard to soft and squishy, it depends on your application. The harder the underfill, the more mechanical support it provides, and the less stress the bumps take. Simple enough.

That brings us to another material, the Polyimide layer. When chips went to a low-K dielectric material, which is not the same as the high-K gate material, it proved a problem with packaging, bumps and underfill. The solution was to put a polyimide layer, sometimes called a stress layer, to cover the bottom of the chip. This prevents contamination and mechanical damage.

Pick an underfill that is too soft, it won't provide enough mechanical support, and your chip dies an early death. Pick one that is too hard and it rips the polyimide layer off. In the words of one packaging engineer interviewed for this article, if you used too hard of an underfill, the chip "wouldn't survive the first heat cycle". The magic is in the middle, you have to pick a bowl of porridge, I mean underfill, that is strong enough to provide the support you need, but not strong enough to rip layers off your chip. Like I said, package engineering is not for the faint of heart.

Tg and Nvidia underfill

That brings us to the billion dollar question, why not simply change bump types to eutectic if they are that much better, which they are, in some ways. The answer is in the current capacity, more specifically average current capacity. We mentioned this earlier, and the idea ties into the hot spots and functional units.

If you take a hypothetical simple CPU that has an integer and floating point pipeline. If you are doing heavy integer work, the power bumps that supply that part of the chip will be loaded heavily and the floating point bumps will not be doing much of anything at all. When floating point load gets heavy, the opposite happens.

The layout of the bumps is designed so that neither set will be overloaded at peak times, and in fact won't get all that close to their maximums. To use completely made up numbers, take a bump that has a peak capacity of 1000mA, and for longevity you don't want to exceed 800mA, basically a 20% safety margin.

If the chip TDP divided by the number of bumps, IE the average current per bump is 200mA, then there are likely many bumps drawing 100mA and a few heavily loaded areas that draw 600mA. This draw moves around with the work the chip is doing. Some may never break 100mA, others may be at 600mA for their entire lives. All are well below the 800mA average, much less the 1000mA max.

The problem with eutectic bumps is that they have a lower current capacity, and the closer you get to it, the worse the problem of electromigration gets. Lets pick a hypothetical eutectic bump that has a peak capacity of 500mA and the same 20% safety margin, 400mA for long life. If Nvidia wants to swap in eutectic bumps for the high lead they are using, there is a slight problem, they are well over the current capacity of the new bumps.

If the chip actually powers up without letting smoke out, the first time you fire up a massive game of Telengard, it will most assuredly go pop. In the rare case that the gods of luck are staring right at you and the thing doesn't fry immediately, electromigration will ensure it has the lifespan of a mayfly, basically worse than the current crop of defective Nvidia chips.

What do you do? You can either cut the power used by the GPU way way down, that is, clock it at a point where no one would ever buy it, or rearrange where the bumps are placed. The rearrangement is not a trivial thing, and may require moving large parts of the chip around, basically a partial relayout. This is expensive, time consuming, and likely can't be done and validated within the time the chip is on sale.

The other option is basically just as bad, you need a power plane or power grid on the die. This is a metal layer that distributes power across the die, and it means adding a layer to the chip. That means expense, slightly lower yield, and can have other detrimental effects to power draw and clocking.

All of these things can be dealt with if you see this coming when you start designing the GPU. It is pretty painfully obvious that Nvidia didn't, otherwise they wouldn't have used high lead bumps and gotten in the hole that they are in now. They have switched to eutectic bumps, but given the way it is being done, and the supplier grumbles we are hearing, it appears to be very poorly planned. It will be interesting to see the lifespan of these new parts.S|A

Discuss this in our forums

File under Microprocessors and Mobile and Graphics and Channel and Opinion and Rumors and Desktop and Efficiency and Gaming and Software and Finance

Slashdot del.icio.us Technorati Reddit Digg YCombinator TwitThis

19 Comments

  1. Darius July 12, 2010, 3:31 a.m.

    With the launch of GF104 your credibility is gone for good now Charlie. You knew it the moment GF104 turned out to have redesigned SMs with 48 Cuda cores and superscalar design.

    So much for all your talk about GF104 beeing cut down GF100 :D


  2. Fudo July 12, 2010, 4:32 a.m.

    This is an obvious BLACK PROPAGANDA to ruin the GTX 460 launch. Why rehash old news?

    Charlie, what in the hell motivated you to rehash this article? Because if you just kept on reposting old news, you're much much worse than how nvidia is renaming the 8800GT all the way to GTS250.


  3. OregonTrail July 12, 2010, 6:26 a.m.

    It's pretty simple. If Charlie doesn't have anything bad to say about Nvidia (which is rare), nothing will be said at all. You'll never see an article covering the GTX460. Wait..... scratch that. If you DO see an article covering the GTX460 (GF104), rest assured it will be because Charlie found a fault with it (imagined or otherwise).
    Unfortunately, Charlie won't go away. It's like ancient Rome and the Coliseum. People gathered in flocks to see somebody disemboweled and cheered about it.
    Charlie's articles share the same fundamental characteristics. Everyone loves a good roast. Especially ATI fans when it comes to Nvidia's demise.
    Shape up Charlie. As of right now, and has been the case for quite some time now, you kind of suck as a blogger, let alone any sort of journalist. Especially lashing out like a little child re-posting an ancient article with a comment update disclaimer at the top. We know why you posted this, as do you.

    OT


  4. Nick July 12, 2010, 6:40 a.m.

    Once again, great article Charlie, hopefully NVIDIA learned from there mistakes, because many in the gaming industry are fed up with there mindless arrogance. It’s bad enough they are attacking the PC gamer by being anti-open standards.
    It’s clearly evident NVIDIA does not want people to know the issues plaguing there graphics cards. Though there current beat up stock price tells a different tail.

    Once again, keep letting people know the TRUTH and hopefully perhaps one day NVIDIA will swallow its mindless arrogance and start supporting us gamers. Until that day it’s ATI all the way.


  5. Phudi Yawa July 12, 2010, 7:31 a.m.

    Using old news as a smear campaign? Lame.


  6. distinctively July 12, 2010, 7:42 a.m.

    I love the old articles being re-published. Its a good reminder to everyone about why nVidia is never trusted around here. I'm sick of having to diagnose computers with "bumpgate disease" and tell people "good luck getting compensation from nVidia for your typically defective part.


  7. wuttz July 12, 2010, 8:29 a.m.

    fudo,
    copy pasting comments is like nvidia copy-pasting design uarch from GF100 to GF104.


  8. Dead-Time July 12, 2010, 8:44 a.m.

    I love it! piss those Nvidia fanbois off!


  9. NormanBates July 12, 2010, 8:58 a.m.

    sorry, charlie, but this was very bad timing: you can't save the fact that you missed with your GF104 expectations (just as I did) by re-branding your great stories of times past

    the reason I visited semiaccurate today is that I expected some actual criticism of GF104, based on some minor details that other journalists may have missed; or, if there are no unseen flaws in the new chip, just some praise for the minor miracle that this would imply; finding this instead wasn't a nice surprise

    the bare minimum would have been some analisys in terms such as these:

    performance comparison, based on geometric average of anandtech's results at 1920x1080:
    ati 5830 = 91
    nvidia 460-768 = 100
    nvidia 460-1GB = 108
    ati 5850 = 119
    ati 5870 = 139

    the die size is very similar, and thermals seem to be fine, so maybe yields are comparable too; in that case, the manufacturing cost of 5870-5850-5830 should be close to that of the new 460

    that means nvidia is still under water in terms of performance-per-mm2, but they achieved an impressive improvement with GF104

    so AMD could just lower the prices of 5830 (to under 170) and 5850 (to under 250) to crush nvidia and keep it starving until southern islands come around

    it wouldn't be the first time, but I would be surprised if they didn't do that: if they don't, they risk loosing traction in their uphill battle to win mindshare

    not surprisingly, tom's monthly GPU recommendations have included very few nvidia cards this spring... but still most people think nvidia when they think gaming graphics; that is changing slowly, but if there are swings on leader post, I'd expect very minor further mindshare gains for AMD's front


  10. NormanBates July 12, 2010, 9:09 a.m.

    additional info: prices for cards in stock at www.alternate.de

    460-768 = 205eur
    5770 1gb = 150eur
    5830 1gb = 200eur
    5850 1gb = 294eur
    5870 1gb = 399eur


  11. Phaxmohdem July 12, 2010, 9:18 a.m.

    @Norman - Here's some analysis I didn't see anybody talking about in the reviews out there :)

    http://www.semiaccurate.com/forums/sh...


  12. rich wargo July 12, 2010, 9:25 a.m.

    Thanks for re-running these articles, Charlie. Reminds me why I'll not buy any nVidia crap until they mend their evil ways.

    Fudo? Go suck a dozen raw eggs. You have far less credibility than the fence post out back. At least it does something useful. Admit it, you're just a shill for nVidia; you have NO objectivity.


  13. Olivon July 12, 2010, 11:38 a.m.

    No article GF104 is broken ? Surprising ...


  14. Dahakon July 12, 2010, 2:44 p.m.

    It seems to me that there is quite some uncomfortable feeling with quite some people here concerning bumpgate. So much, that they have trouble swallowing the whole history once again, as if it is contagious. Trashing the reprint, which in the end leads to a new article with honest questions about GF100, is just the same as not wanting to be remembered to the past, not wanting to learn from it, no matter how valuable the lesson may be. If it seems that nVidia as well does not want to learn from hard lessons from the past, it is anyone's right to know that, so that they can make their own well educated decision when it comes to purchasing expensive hardware.
    And to those that still believe Bumpgate is a fairytale, go ask the people that baked their 8800GTX/Ultra's in their kitchen-oven, just to make it work again (for whatever short period of time, if at all).


  15. Stalker July 12, 2010, 5:11 p.m.

    Olivon July 12, 2010, 11:38 a.m.:
    "No article GF104 is broken ? Surprising ..."

    Um... that one was posted more than half a year ago. Do your homework.

    @NormanBates, alternate is hardly the place to compare prices... try schottenland.
    Also, what did you expect in a market without competition? Wasn't nvidia doing the same thing int heir past success period? Yes, they did, and people payed without moaning how expensive that GTX was. Besides, prices will be adjusted once demand for 460 starts to pick up, and mind you, ATI has a lot of margin space to cut from and still be profitable. Nvidia, not so much.


  16. Super XP July 12, 2010, 8:44 p.m.

    Very nice indeed Charlie, there's absolutely nothing wrong with telling people the truth. Those arrogant pricks at NVIDIA needed this to happen. I think it’s time NVIDIA fires the CEO and hires somebody with proper credentials.
    Those hammering Charlie need to grow the “F” up. Nobody can’t be right 100% all the time, but he did a dam good job revealing the truth about NVIDIA’s failed Fermi. I’m sticking to the HD 5800 series, I like my graphics cards to not burn up on me while slaughtering zombies in L4D2.


  17. Mike Leeman July 13, 2010, 11:27 a.m.

    Getting down to the grit of things is great. I like the fact you tell it as it should be told. People don’t like buying products that are defective by design. I don’t know lately, but it seems as though NVIDIA’s been having a lot of issues not only with there graphics cards but any chip they’ve designed as of late. Good for ATI, but bad for competition especially with NVIDIA lying through there teeth and trying to cover up the truth about how bad there products are failing.
    Good catch Charlie, I look forward to your next article based on NVIDIA’s treachery.


  18. Heatesssun July 13, 2010, 12:09 p.m.

    Wow this stuff just keeps getting stranger and stranger. Even in the face of a UNANIMOUS verdict by reviewers that the 460 is the best value ever at $200 we get the same old story. Truly sad. No truth, no understanding, no appreciation for anything, simply one mans belief that has no meaning in the real world.


  19. Kat July 14, 2010, 8:37 p.m.

    The 460 is unrelated. The 480 sucks back a lot of power for poor performance, problem chip. Sure NVIDIA will eventually fix the problem, but until they do, facts are facts period. Charlie is displaying facts about NVIDIA chips in general, they are problematic. Heck even NV’s stock price tanking says it all.


Add your comment





Comments are un-moderated except for automatic spam-reduction services, these services are not related to liposuction or any other dieting method. Hitting the [POST] button here is the legal equivalent to self-publishing. This means that you are liable and therefore RESPONSIBLE for all consequences of what you are writing and publishing. S|A is not and will not be held liable for your publications using our platform. We will happily turn over your IP address to any legal authority with a valid search warrant.

Comments are un-moderated except for automatic spam-reduction services, these services are not related to liposuction or any other dieting method. Hitting the [POST] button here is the legal equivalent to self-publishing. This means that you are liable and therefore RESPONSIBLE for all consequences of what you are writing and publishing. S|A is not and will not be held liable for your publications using our platform. We will happily turn over your IP address to any legal authority with a valid search warrant.