Editors Note: From time to time, SemiAccurate will be republishing some older articles by its authors, some with additional commentary, updates and information. We are mainly reprinting some of the oft referenced articles that originally appeared on the Inquirer. Some will have added content, but all will be re-edited from the originals as per contractual obligations. You may see some slight differences between the two versions.
This article has had some of the original links removed, and was published on Tuesday, September 2, 2008 at 10:03AM.
SOURCES CLOSE TO Dell say they knew about the problem a year ago, that would put awareness at Summer/Fall 2007, and HP is on record as being aware in November 2007, so there has been about a year to characterize the problem, design a solution and test it. Multiple sources involved with package engineering tell us that this is not nearly enough time to do a proper test regime, much less long term reliability studies.
This new package and materials set does not appear to have been nearly as carefully vetted as it should have been. It may work, then again, it may not. If the lack of power distribution changes is accurate, we may very well be reading about Nvidia Defective Chipsgate II in a couple of years. [Please see Editor’s Note at bottom]
How widespread is the problem? We told you about G84 and G86s as well as G92 and G9sl. From the materials side, it appears that all non-R and non-F lot numbered parts made on the 65nm and 55nm processes are defective. The flaw is a downright idiotic materials mix choices coupled with poor chip design and inadequate testing. It is a case of errors compounding errors. They are all defective.
If this is the case, why aren’t we seeing more defective desktop parts? That one is easy, thermal stress. It has two components that lead to a bump fracturing. First is the amount of the stress, that is the hot cold temperature delta, and secondly the number of times the part is powered up and down, aka the number of heat cycles. Glass cups in the oven would be the amount of stress, the bended fork would be the number of cycles.
If you remember back to the Nvidia 8-K where they announced the problem, Nvidia said, “….and customer use patterns are contributing factors”. By customer usage patterns, they are referring mainly to thermal cycles, but you could also credit them with meaning high temperatures while the GPU is being exercised in gaming and the like.
Desktop systems are turned on once a day or so. Some people leave them on for weeks at a time, others may turn then on and off a few times in a day. The average desktop probably has about one heat cycle a day.
Laptops on the other hand are woken up and put to sleep many times a day. If you take a typical student who wakes up, checks his/her email, goes to three classes and takes notes, goes to a coffee shop for a bit, goes home, watches a video or two, then goes to sleep, it is not hard to make a case for 10+ power cycles a day. Every wake up/sleep or hibernate cycle is a heat cycle, so dozens are not out of the question.
The more cycles you put on it, and the more severe they are, the quicker these defective parts will die. A good way to look at it is to assign each stressor a value in the lifespan of a critical bump, and give the bump a total amount of stress it can take before it cracks. Lets call this number 100AU for Arbitrary Units. If a power on cycle is worth 4 AU, and a hardcore gaming session with the GPU overclocked to within 1MHz of it crashing is worth 15, you can figure out when it should die. Remember, these are made up numbers, the point here is the concept of different stressors contributing to the overall failure.
When Dell, HP and others announced a BIOS ‘fix’ for Nvidia’s chip problems, the reason it was so humorous was that all they were doing was lowering the amount of thermal stress on the chips when the fan would not normally be on. When the fan is going full tilt without the ‘fix’, the new ‘updated thermal profiles’ won’t make a difference. When the fans are normally off or on low, the profiles will essentially lessen the stress from a 4 to a 3. It is just there to allow the laptop to live through the warranty period so the companies don’t have to pay for the fix. After that, if the defective chips burn out, it isn’t their problem. The ‘fix’ doesn’t fix anything at all.
In the end, it comes down to Nvidia screwing up badly on package engineering and testing, then trying as best they can to bury the problem while passing the buck. It appears that every Nvidia 65nm and 55nm part with high lead bumps and/or low Tg underfill are defective, it is just a question of how defective they are, and when they will die.
As far as we are able to tell, contrary to Nvidia’s vague statements to the contrary, there are no materials defects. Every material they used lived up to the claimed specs, and every material they used would have done the job while kept within the advertised parameters. Nvidia’s engineering failures put overdue stress on the materials, and several failures compounded to make two generations of defective chips.
When they started talking about this, Nvidia failed crisis management 101, and the coverup shows they don’t care about consumers, just their pocketbooks. They are doing exactly the wrong thing for the wrong reasons, and the lawyers circling with class action paperwork in hand are going to eat them alive.
The last time you had such a huge batch of defective GPUs, the company that did it swore up and down just like Nvidia that there was no problem despite forums filled with evidence to the contrary. A few weeks later, they turned around and admitted there was a problem, and took a $1.1Billion charge, placating customers and fending off lawsuits. You know that as the Xbox 360 Red Ring of Death.
I wonder why Nvidia can’t be that smart?S|A
Editor’s note: We apologize that we did not get this one spot on. It was not, in retrospect, called Nvidia Defective Chipsgate but, as widely sourced on the intertubes “bumpgate.” In the future we shall call it Bumpgate. Part II, however, will be available shortly.
Latest posts by Charlie Demerjian (see all)
- How many PCIe5 lanes does Sapphire Rapids have? - Mar 20, 2020
- What does Intel’s server platform cancellation mean? - Mar 16, 2020
- A few more bits on Intel’s server cancellation - Mar 16, 2020
- Intel kills off a server program - Mar 14, 2020
- When is Samsung’s 2nd gen zNAND due? - Mar 10, 2020