Averting excessive oopses

Posted Nov 20, 2022 21:07 UTC (Sun) by NYKevin (subscriber, #129325)
In reply to: Averting excessive oopses by adobriyan
Parent article: Averting excessive oopses

It depends on how their system is set up. If you use a cattle-not-pets mentality, then you probably have panic_on_oops enabled and some higher-level control plane (such as k8s) is responsible for rescheduling containers onto other available nodes. In that case, this change would not affect you at all, because you're effectively already running with an oops_limit of 1.

Averting excessive oopses

Posted Nov 29, 2022 0:41 UTC (Tue) by jccleaver (guest, #127418) [Link] (5 responses)

Fleet not cattle. If your boxes have state, data locality, and preferred workloads, then they develop individual quirks. Doesn't mean they have to be babied like your pal Spot, but presuming all wear and tear has been identical forever seems ill-advised.

Averting excessive oopses

Posted Nov 29, 2022 11:27 UTC (Tue) by farnz (subscriber, #17727) [Link] (4 responses)

With a fleet of cattle, as opposed to a fleet of pets, you don't presume that all wear and tear is identical - you have monitoring that measures the state of each machine individually, and handles it. E.g. your monitoring detects that machine 10,994,421 is frequently throwing ECC corrected errors, and it migrates all jobs off that machine and sends it off for repair. Or you detect filesystem corruption, and migrate everything off then send that machine to repair.

The key is that automation handles everything if you have cattle, not pets. Instead of knowing that http10294 is a unique snowflake with certain problems and fixes for problems, you have your automation look at the state of all hardware, and identify and deal with the problems of the machines en-masse.

Averting excessive oopses

Posted Nov 29, 2022 16:14 UTC (Tue) by Wol (subscriber, #4433) [Link]

> With a fleet of cattle, as opposed to a fleet of pets,

Shouldn't that be a hurd?

(G, D&R)

Cheers,
Wol

Averting excessive oopses

Posted Nov 29, 2022 22:47 UTC (Tue) by jccleaver (guest, #127418) [Link] (2 responses)

> The key is that automation handles everything if you have cattle, not pets.

At the risk of mixing metaphors further: you can automate pets as well. Plenty of shops use robust config management but don't (for a variety of different reasons) presume that indefinite horizontal scaling is the proper solution to all of life's problems, and thus can have a lot of systems running where the system cattle type n <= 4, which when combined with data locality, master/standby, or prod/test, is very much in the pet range for whatever number your tweaks go to.

Not everyone is Google Scale, and designing for Google Scale -- or creating enough complexity so that you can self-fullfillingly justify designing for Google Scale -- constitutes owning only the world largest k8s hammer.

Averting excessive oopses

Posted Nov 30, 2022 12:12 UTC (Wed) by farnz (subscriber, #17727) [Link] (1 responses)

Absolutely - and automating admin is not, in itself, a sign that you're going for the "herd of cattle" phase. It's still valuable to automate repeated jobs for pets.

The point of "herd of cattle" is that there's a continuum from the "pet" (e.g. my laptop), where all admin is done by hand, as a careful operation, through to the "cattle" where you have so many servers to look after that doing admin by hand is simply too time-consuming, and thus you must automate everything so that any server that leaves the "expected" state is either repaired by automation or kicked out as "definite hardware fault" by automation, with clear instructions on how to fix or replace the faulty hardware.

Where you need to be on that line depends on how much time you have to do admin. If you have 40 hours a week to look after a single server that runs a single service, then pets are absolutely fine. If you have an hour a month to look after 10,000 servers, then you need to treat them as cattle. If you have 8 hours a week and 10 servers, then you're probably somewhere in the middle - you want to have the baseline dealt with by automation (things like checking for hardware failures, making sure the underlying OS is up to date etc), but you can treat the services on top of the hardware (the things you care about like a database setup, a web server, your organisation's secret special code etc) as pets.

And it's a good rule of thumb to assume that anything you do more than 3 times should be automated. For a small shop, this means that the things that recur have you treating the servers as cattle (why care about manually checking whether both disks are present in the RAID-1, or that btrfs checksums match data, when the server can do that automatically and notify you when bad things happen? Why manually choose all the options in the installer, when kickstart or equivalents can do that for you?) but once-offs (there's been a schema change from 2021-09 to 2021-10 deploy, needs manual application) are done by hand, treating the affected servers as pets.

Averting excessive oopses

Posted Nov 30, 2022 16:30 UTC (Wed) by Wol (subscriber, #4433) [Link]

> Absolutely - and automating admin is not, in itself, a sign that you're going for the "herd of cattle" phase. It's still valuable to automate repeated jobs for pets.

> The point of "herd of cattle" is that there's a continuum from the "pet" (e.g. my laptop), where all admin is done by hand, as a careful operation,

Even with a single, "pet" computer, automating admin where possible just makes sense. My /usr/local/bin has a few scripts (I ought to write more) that do assorted things like creating lvm volumes to assist my weekly backups/emerges, and there's plenty of stuff I ought to do regularly, that scripts would help with.

Cheers,
Wol