Some unlikely 2021 predictions

Posted Jan 10, 2021 13:44 UTC (Sun) by excors (subscriber, #95769)
In reply to: Some unlikely 2021 predictions by NYKevin
Parent article: Some unlikely 2021 predictions

> I'm not a hardware expert, but my simplified and (probably) wrong understanding is as follows: You can't economically turn out perfect silicon every time. Instead, the production process has a certain percentage of errors, and you try to lower this percentage as much as possible, but eventually you start running into diminishing returns.

Also not an expert, but I think you're conflating two different classes of issue:

1) Errors in the design. Often bugs in the HDL code that defines the chip's behaviour and that were not detected by static analysis or by testing in simulation; sometimes bugs that are outside of the logic you were testing, e.g. a crypto block that leaks secrets to an attacker who's precisely monitoring the power lines.

If you discover those issues while running tests on real silicon, it takes many months and millions of dollars to fix the design and produce a new revision of the silicon. It's usually possible, and far cheaper and quicker, to develop a workaround in software(/microcode/firmware/etc), so CPUs typically come with lists of dozens or hundreds of errata that the relevant software developers need to be aware of. In extreme cases the workaround might be to disable a major hardware feature (like TSX in Haswell), though usually the workarounds are much less painful.

If you're producing a new revision of the silicon anyway (e.g. to fix an unworkaroundable bug, or for a new chip with new features), but you have a good software workaround for a particular errata, you might still choose not to fix that errata. 'Fixing' the design risks introducing new bugs (which may delay the new revision for months and cost millions more), so it's safer to leave it alone, and the errata can persist for a long time.

2) Random defects in the fabrication of each chip, because you're pushing the manufacturing technology to the limit and it's a messy physical process and the atoms don't all go where you want them to. In that case you can turn off (i.e. configure the hardware/microcode/firmware/software/etc to not use) the bad parts of the silicon, or run at a lower frequency / higher voltage that doesn't trigger the fault. You get chips with a lot of variation in capabilities and power efficiency, then the marketing people work out how to divide them up neatly and sell them at different prices.

All of those practices will continue because they're sensible economic tradeoffs, and because computers would be unaffordable if customers insisted on perfect hardware.