Seems IBM stopping shipment of these buggy PCs forced Intel to fix the problem. I cannot imaging today's IBM doing something like this.
comex [3 hidden]5 mins ago
No, F00F was a different bug from a few years later.
pixl97 [3 hidden]5 mins ago
Yep, locked the PC up dead.
rbanffy [3 hidden]5 mins ago
> I cannot imaging today's IBM doing something like this.
Today's IBM doesn't ship Intel servers, or Intel anything. They sold that part of their business to Lenovo.
wglb [3 hidden]5 mins ago
However I was quite concerned at the time when they said, quite incorrectly that there was no risk to their users from this bug.
charcircuit [3 hidden]5 mins ago
What is the risk? It seems very small to me.
wglb [3 hidden]5 mins ago
The risk is that some divisions will be fully off. If it is a chain of calculations, e.g. some stress analysis, or a spreadsheet involving a chain of calculations for a financial report, it could be bad.
charcircuit [3 hidden]5 mins ago
Divisions being off isn't the end of the world. Even without the bug the division can be fully off due to it using fixed precision floats.
Stress analysis and financial reports are more likely to be wrong due to other sources of error than a division being slightly off. If you really wanted exact numbers you wouldn't be using fixed precision floats anyways.
inetknght [3 hidden]5 mins ago
> If you really wanted exact numbers you wouldn't be using fixed precision floats anyways.
Let the adults play with things that need to work exactly as documented (such as IEEE 754 floating point representations) and therefore can be relied upon when required. You can go back to building your little unreliable toys that nobody uses.
charcircuit [3 hidden]5 mins ago
There is no need to belittle me while not providing a practical example where the average consumer can be harmed by this bug.
genewitch [3 hidden]5 mins ago
humans are potentially harmed when developers use an unsigned int as a counter and it rolls to zero. Or a byte, in the case of medical radiation machines.
I guarantee if you had access to a full nntp text dump from this era you'd find some "harm"
Intel is dead, long live Intel.
charcircuit [3 hidden]5 mins ago
>when developers use an unsigned int as a counter and it rolls to zero
Yet, people wouldn't expect to return their CPU if this happened. The entire technology stack of a computer is filled with bugs, yet people are able to use them to great utility every day.
fragmede [3 hidden]5 mins ago
therac 25?
charcircuit [3 hidden]5 mins ago
The average consumer with a pentium didn't have it in a radiation therapy machine.
rbanffy [3 hidden]5 mins ago
But those who used a Therac 25 wouldn't be happy.
wglb [3 hidden]5 mins ago
From the wikipedia article:
Abrash spent hours tracking down exact conditions needed to produce the bug, which would result in parts of a game level appearing unexpectedly when viewed from certain camera angles.
charcircuit [3 hidden]5 mins ago
Yet, they thought the 1 frame flash was insignificant enough to ship the game with it instead of spending time to workaround the bad division. But thank you for providing an example.
hansvm [3 hidden]5 mins ago
Alright, then quantum chemistry simulations. It's very common in the field to have algorithms with known error bounds given a certain floating point size and to choose a size amenable to the scale of simulation you intend to attempt. If some of your computations are at half precision, the results are hosed.
charcircuit [3 hidden]5 mins ago
Most consumers are not doing quantum chemistry simulations.
eru [3 hidden]5 mins ago
You don't need precise numbers to figure out whether your bridge will stand. What you need is a calculation designed to be robust to the errors incurred in measurement and computation.
The standard for floats guarantees you specific and precise error bounds that you can use to do an error analysis for your whole calculation. Most likely whatever engineering software you use to check your bridge design, will already have this error analysis baked in.
If you introduce some arbitrary other errors, you'd have to redo you error analysis from scratch. And it might not even be tractable, depending on the errors introduced. (The standard floating point error guarantees are designed to behave reasonably well and easily predictably when combined into a larger calculation.)
insufferable_tw [3 hidden]5 mins ago
This is a perfect example of "normalization of deviance".
brookst [3 hidden]5 mins ago
aka six sigma?
MBCook [3 hidden]5 mins ago
Yeah but when the bug triggered you only got like eight digits worth of floating point.
The article says IBM expected normal users to hit it every few days.
charcircuit [3 hidden]5 mins ago
Hitting the bug doesn't mean that it would cause a practical issue for the user.
colechristensen [3 hidden]5 mins ago
You just have no idea what you're talking about. People get killed when things go wrong, and this "oh well other problems are probably worse" attitude is dangerous.
There's no such thing as exact numbers, but there is such a thing about reliable models. The errors introduced by calculating with numerical methods are studied and well understood, a processor not following exactly the rules it's supposed to is an enormous problem.
Here's a little introduction to condition numbers and how they're used to understand floating point error introduced in calculations:
The FDIV bug is not theoretical. It existed and no one died from it. People love to come up with theoretically how the bug can cause terrible things to happen, but in practice it didn't. The next run of the processor had the fix and the world moved on.
colechristensen [3 hidden]5 mins ago
1. Intel wasn’t very popular for scientific computing in 1994
2. No one was stupid enough to make life critical calculations on Intel after it was discovered and widely publicized
You, on the other hand, are suggesting it was no big deal and acting like people doing important work should have just ignored the bug. The reason bugs like this didn’t kill people in a large disaster is that folks with your disposition weren’t in charge of making decisions that would have led to that.
They did a recall that cost Intel a billion dollars adjusted for the present. It wasn’t just ignored.
bendercorn [3 hidden]5 mins ago
Should have mentioned that intel marketing and PR launched the User Test Program to make sure advanced users got early PPro systems to make sure there were no lurking fdivs. Nicely was the first recipient. So was John Williams the composer.
rbanffy [3 hidden]5 mins ago
John Williams from Star Wars or John Williams from Sky?
FWiW I bought (one of) Kevin Peek's guitar amps in 1982 in Kalamunda (W.Australia) .. it was just an ad in the classifieds, got out there and it was a damn near world class music studio in a one room music cabin in the bush.
The Intel marketing is responsible for a large number of despicable decisions during the years, but I consider that by far the most despicable thing done by them happened when they have segmented their CPU products into Pentium and Pentium Pro.
Later, they have dropped the "Pro" naming scheme and the successors of "Pentium Pro" have been branded as "Xeon", until today.
IBM had been wise and they had incorporated memory error detection as a standard feature of every IBM PC, so that has also been true for all IBM PC clones.
By the early nineties, when the memories packaged in dual-in-line packages have been replaced by memory modules, you could buy complete memory modules with error detection, but there were also slightly cheaper memory modules without error detection, so a computer owner could choose either of them. I am not a gambler, so I have always used only modules with error detection.
However that has changed in 1994, when Intel has decided to split their CPUs into Pentium for "consumers" and Pentium Pro for "professional users" who were willing to spend much more for a workstation or server computer.
This is when Intel has decided that in order to stimulate their customers to buy overpriced "Pro" CPUs, memory error detection must be removed from their "consumer" CPUs.
While in 1993 the first generation of computers with Pentium still had memory error detection, for the second generation in 1994 (with the Triton chipsets), memory error detection was removed, in preparation for the launch of Pentium Pro next year.
We will never know the value of the financial losses that have been inflicted worldwide upon naive computer users by this Intel decision.
Fortunately for Intel and unfortunately for us, software bugs have always been so frequent that all computer users have been conditioned to assume automatically that whenever the computer crashes or data corruption is discovered, the cause must have been some software bug and it must be difficult or impossible to determine the exact culprit.
Despite this common assumption, many of these incidents may be caused by hardware memory errors and besides the noticed incidents there may be many other cases of data corruption that have never been discovered.
The claim of Intel that this removal of memory error detection has been done for the benefit of the customers, to reduce the price of the computers, is of course false. After it became impossible to have memory error detection in "consumer" PCs, there was no price reduction in motherboards or memory modules, the prices have remained the same and their vendors had increased profits, so they have supported enthusiastically the initiative of Intel.
Coe's ratio: https://people.cs.vt.edu/~naren/Courses/CS3414/assignments/p... (3rd page)
Linux code: https://github.com/torvalds/linux/blob/b0cb56cbbdb4754918c28...
Edit:
Seems IBM stopping shipment of these buggy PCs forced Intel to fix the problem. I cannot imaging today's IBM doing something like this.
Today's IBM doesn't ship Intel servers, or Intel anything. They sold that part of their business to Lenovo.
Stress analysis and financial reports are more likely to be wrong due to other sources of error than a division being slightly off. If you really wanted exact numbers you wouldn't be using fixed precision floats anyways.
Let the adults play with things that need to work exactly as documented (such as IEEE 754 floating point representations) and therefore can be relied upon when required. You can go back to building your little unreliable toys that nobody uses.
I guarantee if you had access to a full nntp text dump from this era you'd find some "harm"
Intel is dead, long live Intel.
Yet, people wouldn't expect to return their CPU if this happened. The entire technology stack of a computer is filled with bugs, yet people are able to use them to great utility every day.
Abrash spent hours tracking down exact conditions needed to produce the bug, which would result in parts of a game level appearing unexpectedly when viewed from certain camera angles.
The standard for floats guarantees you specific and precise error bounds that you can use to do an error analysis for your whole calculation. Most likely whatever engineering software you use to check your bridge design, will already have this error analysis baked in.
If you introduce some arbitrary other errors, you'd have to redo you error analysis from scratch. And it might not even be tractable, depending on the errors introduced. (The standard floating point error guarantees are designed to behave reasonably well and easily predictably when combined into a larger calculation.)
The article says IBM expected normal users to hit it every few days.
There's no such thing as exact numbers, but there is such a thing about reliable models. The errors introduced by calculating with numerical methods are studied and well understood, a processor not following exactly the rules it's supposed to is an enormous problem.
Here's a little introduction to condition numbers and how they're used to understand floating point error introduced in calculations:
https://www.cs.cornell.edu/~bindel/class/cs6210-f12/notes/le...
2. No one was stupid enough to make life critical calculations on Intel after it was discovered and widely publicized
You, on the other hand, are suggesting it was no big deal and acting like people doing important work should have just ignored the bug. The reason bugs like this didn’t kill people in a large disaster is that folks with your disposition weren’t in charge of making decisions that would have led to that.
They did a recall that cost Intel a billion dollars adjusted for the present. It wasn’t just ignored.
FWiW I bought (one of) Kevin Peek's guitar amps in 1982 in Kalamunda (W.Australia) .. it was just an ad in the classifieds, got out there and it was a damn near world class music studio in a one room music cabin in the bush.
Later, they have dropped the "Pro" naming scheme and the successors of "Pentium Pro" have been branded as "Xeon", until today.
IBM had been wise and they had incorporated memory error detection as a standard feature of every IBM PC, so that has also been true for all IBM PC clones.
By the early nineties, when the memories packaged in dual-in-line packages have been replaced by memory modules, you could buy complete memory modules with error detection, but there were also slightly cheaper memory modules without error detection, so a computer owner could choose either of them. I am not a gambler, so I have always used only modules with error detection.
However that has changed in 1994, when Intel has decided to split their CPUs into Pentium for "consumers" and Pentium Pro for "professional users" who were willing to spend much more for a workstation or server computer.
This is when Intel has decided that in order to stimulate their customers to buy overpriced "Pro" CPUs, memory error detection must be removed from their "consumer" CPUs.
While in 1993 the first generation of computers with Pentium still had memory error detection, for the second generation in 1994 (with the Triton chipsets), memory error detection was removed, in preparation for the launch of Pentium Pro next year.
We will never know the value of the financial losses that have been inflicted worldwide upon naive computer users by this Intel decision.
Fortunately for Intel and unfortunately for us, software bugs have always been so frequent that all computer users have been conditioned to assume automatically that whenever the computer crashes or data corruption is discovered, the cause must have been some software bug and it must be difficult or impossible to determine the exact culprit.
Despite this common assumption, many of these incidents may be caused by hardware memory errors and besides the noticed incidents there may be many other cases of data corruption that have never been discovered.
The claim of Intel that this removal of memory error detection has been done for the benefit of the customers, to reduce the price of the computers, is of course false. After it became impossible to have memory error detection in "consumer" PCs, there was no price reduction in motherboards or memory modules, the prices have remained the same and their vendors had increased profits, so they have supported enthusiastically the initiative of Intel.