Hyrum's Law for Downtime
Hyrum's Law for Downtime
Posted May 3, 2022 13:48 UTC (Tue) by atnot (guest, #124910)In reply to: DeVault: Announcing the Hare programming language by nix
Parent article: DeVault: Announcing the Hare programming language
Lets say for example (inspired by real events) that you have some some super reliable database cluster. It's so simple and reliable that it doesn't fail a single request for two years. Over that time, a lot of applications get written that consume that service. Because it is so reliable, the developers never notice that they have introduced bugs in their timeout, retry, backoff or failover logic.
Then one day, there's a hiccup on one instance of the cluster and it hangs on some requests for a few seconds. A small fraction of the application processes hang, or crash and get restarted. Because the database is incredibly reliable, the application has started to depend on the database being available at startup, and starts crashing in a loop. The database instance gets overwhelmed and goes unresponsive for a few seconds. This repeats the process, causing more and more application services to crash, until eventually none are left in a running state. All of them are constantly hammering the database trying to start up, taking it down completely. This outage cascades through all of the downstream dependents, taking days to fully resolve.
When people accidentally rely too heavily on things being available, even the smallest, transient failures start having serious consequences. Those consequences often cause far more damage and user-visible downtime than simply causing a few seconds of deliberate downtime a month would have.
