Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
Posted Jul 28, 2025 11:12 UTC (Mon) by intelfx (subscriber, #130118)Parent article: LWN is back
Posted Jul 28, 2025 11:17 UTC (Mon)
by corbet (editor, #1)
[Link] (21 responses)
Posted Jul 28, 2025 11:39 UTC (Mon)
by mote (guest, #173576)
[Link] (19 responses)
Linode employees are attempting to recover the entire DC runtime from a hard cold machine start, no small feat - I've worked at a company like this that had a similar outage in the late 2000s due to a truck hitting the building (yeah); once the cooling goes, you *have* to shut down servers or electronic components start to melt themselves.
Posted Jul 28, 2025 13:14 UTC (Mon)
by iabervon (subscriber, #722)
[Link] (8 responses)
Of course, losing power generally means you don't have the servers up longer than the cooling, because there's no point being able to power machines that are just going to overheat. The surprising situation I remember hearing about is when the building loses its water supply, at which point the computers aren't lacking anything they need but the cooling won't work. (If I recall correctly, maintenance work on one of the building's water supplies accidentally damaged the building's other water supply, which is obviously physically nearby.)
Posted Jul 28, 2025 13:46 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (7 responses)
Turns out that when you actively try to build a runbook that lets your operations team shut down the DC and start it back up from scratch, you find an awful lot of unintentionally circular dependencies - and all of the circular dependencies you find are a result of locally sensible reasoning that didn't take into account the complexity of the DC as a whole.
Posted Jul 28, 2025 14:19 UTC (Mon)
by paulj (subscriber, #341)
[Link] (6 responses)
Posted Jul 28, 2025 16:05 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (5 responses)
For example, the door badge system might be configured to automatically sync access permissions from internal directories (so that you don't have to manually update DC access permissions as people join and leave), and fail closed if internal directories including all mirrors in other DCs are unavailable (on the basis that you don't want an attacker to be able to cut the OOB Internet link and get in), then you allow the badge system to use DC Internet (because it's more reliable than the OOB link, so you get fewer problems where it can't sync) and then you terminate the OOB Internet link for the badge system (on the basis that the DC has redundant fibres and routers, so won't go down, whereas the OOB link is a consumer link without redundancy). And then you update the config management system so that, for all but legacy systems (like the DC routers), if you don't confirm that a change is good to test within 5 minutes, it autoreverts, so people develop a habit of testing rather than carefully confirming that they've not made a mistake.
All these decisions sound reasonable in isolation, but they chain together to a world where a mistake with the DC's router configurations result in the door badge system locking you out.
Posted Jul 28, 2025 21:02 UTC (Mon)
by smurf (subscriber, #17840)
[Link] (4 responses)
The interesting part about cyclic dependencies is that you have customers, which need to be notified when their servers go belly-up for some reason (including when the router they're behind blows a fuse). Thus you need dependency tracking anyway. Which presumably should alert you when there's any cycles in that graph. Which should prevent this from happening. Famous last words …
Posted Jul 29, 2025 0:08 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
Posted Jul 29, 2025 7:33 UTC (Tue)
by taladar (subscriber, #68407)
[Link] (2 responses)
Posted Jul 29, 2025 11:02 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
It's obviously not going to know about problems like "Bob has the key, but it is lost on his 100-key keyring", but it (hopefully) can notice things like "it says to get the key, but not where the key lives" that might be implicit knowledge in the author's eyes but these instructions are those that really should have a thorough once-over by someone *not* intimately familiar with the process to help rid it of such implicit assumptions.
Posted Jul 29, 2025 11:29 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
While the LLM made a mistake here (since there's two people you rely on, not one), it's still highlighted that a document saying that 1 of 4 roles is needed to recover has, in fact, become 1 of 2 people.
Posted Jul 28, 2025 13:50 UTC (Mon)
by jzb (editor, #7867)
[Link]
Yeah, losing HVAC/cooling is no fun. I worked for a Denver-based hosting company in 2003-2004 in the NOC overnights—the control panel for HVAC went on the fritz and nothing could be done to bring it back up. At 2 a.m. or so on a Sunday morning. It was a complete and utter nightmare. I had to call in a bunch of folks and we ran around shutting things down and notifying customers. I was somewhat gratified that the senior folks had no more success than I had had trying to bring the HVAC back up... IIRC some part had shorted out or whatever on the control panel. (It was a long time ago...)
Posted Jul 28, 2025 23:05 UTC (Mon)
by bracher (subscriber, #4039)
[Link] (3 responses)
Naturally both grids eventually _did_ go down, and the generators kicked in, all according to plan. And a few minutes later all of the generators died. It turns out that they procedures for _many_ things, including testing the failover to the generators on a monthly basis. The one thing they did _not_ have a procedure for was the refueling of the generators. So when they were eventually needed they had only a few minutes of fuel left.
Posted Jul 29, 2025 6:42 UTC (Tue)
by MortenSickel (subscriber, #3238)
[Link] (2 responses)
Just a couple of weeks before, we had done a system test and had the yearly service. It turned out there was some contamination in the cooling liquid that clogged the radiatior which was not detected in the service... The company doing the service got a few questions to answer, but as mentioned elsewhere here, the good thing is that we learned how to get the DC up from nothing.
Posted Jul 29, 2025 7:56 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (1 responses)
The generator in this incident had a five minute header tank, which was re-filled by a pump from the main tank. Guess what the pump was plugged into ... the mains ...
Cheers,
Posted Jul 29, 2025 14:08 UTC (Tue)
by archaic (subscriber, #111970)
[Link]
Posted Jul 29, 2025 17:44 UTC (Tue)
by yodermk (subscriber, #3803)
[Link]
Haha, I was there working support that night. Fun times!
Posted Jul 30, 2025 9:19 UTC (Wed)
by anselm (subscriber, #2796)
[Link]
Way back at university we once had a week or so of unscheduled downtime because legionella bacteria had been discovered in the main campus server room AC and they had to shut everything down in order to get the cooling system disinfected. This was at the height of summer, of course, with temperatures in the 30°C+ range outside.
It sucked especially because apart from the NFS servers everybody's files were sitting on even if they had their own workstation on their desk, it included the machine everyone was reading their e-mail on (this was in the early 1990s when e-mail was still much more of a thing than it is today, but POP3/IMAP hadn't caught on, so you would telnet to the e-mail server and use the mail command there, or MH if you were a serious e-mail nerd), and some sysadmins spent sleepless nights jury-rigging a replacement system in some cool and airy location elsewhere on the premises just so people could get at their e-mail.
Posted Aug 7, 2025 22:36 UTC (Thu)
by bartoc (guest, #124262)
[Link] (2 responses)
I think it's because C-corporation taxation is quite punishing compared to other ownership structures, particularly for capital gains realization.
Posted Aug 11, 2025 10:02 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (1 responses)
The purpose is not taxation-related in the cases I've seen; rather, it's that the sub-companies have separately audited accounts, and thus you can be confident (as an investor in the holding company) that you're not seeing results "goosed" by accounting tricks to move losses between capital and operational sides. Instead, you've got an audited set of accounts for the capital side, and a separate set for the operational side, which ensures that you cannot be tricked by accounting games.
Posted Aug 11, 2025 14:03 UTC (Mon)
by corbet (editor, #1)
[Link]
Posted Jul 29, 2025 10:35 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link]
I was at a complex London terminus. They told us all trains from one set of platforms aren't running due to a tree falling in the storm - which makes sense, tree on line, can't move trains, no service, checks out.
But then every few minutes they would announce "one" exception which was running, the next train for those platforms, and they continued doing this for over an hour. "No trains" "Except the 1145" "No trains" "Except the 1152" "No trains" "Except the 1156" and so on. Rather than say OK, the trains are running normally but..., or actually not having trains because there's a tree in the way they had decided to gaslight their customers by insisting there wouldn't be any trains, then nonetheless running every single train on schedule anyway.
I assume there's some mismatch between strategy to run a day's trains (tree blocks line, OK no trains) and tactics to run individual services (we can just avoid the closed section, for every single train) and it wasn't anybody's job to ensure the outcome makes coherent sense.
The result however was the normal departure boards don't work - there aren't any trains, they've just said so - so you can't tell when your train will run or where from. But if you just give up and leave the station your train does run as scheduled anyway, and you've missed it, you need to know where it ought to be and then get aboard when inevitably it does run after all. Very silly.
You're certainly not the only one wanting to learn something about what happened ... there is the ongoing incident report, but it is less than illuminating. It kind of reminds me of a delayed flight that is always going to board after another 15 minutes.
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
A previous employer of mine with their own DCs gradually worked towards the point where we could as a matter of routine shut down an entire DC, and bring it back up from cold.
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
And I suspect that if you could review the chain of decisions that led to "door badge system depends on DC being up and running", every single one would make sense in isolation; it's only when you test "what happens when we take the DC down remotely? Can we get in?" that you discover that these decisions have chained together to imply "when the DC is down, door badge system is also down".
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
It can also do things like decide that certain job titles imply particular people (not reliably, but often enough to be useful), and then highlight that your recovery plans depend on Chris being present at site, because Chris is the Site Manager, the Health and Safety Lead, and the Chief Electrician, and your plans depend on the Site Manager, the Health and Safety Lead, the Chief Electrician, or the Deputy Chief Electrician being on site.
LLMs looking over your recovery plan
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
Wol
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
Sounds like double ungood day for Linode
once the cooling goes, you *have* to shut down servers or electronic components start to melt themselves.
Sounds like double ungood day for Linode
I've noted a common pattern across the world (including the EU) where a publicly traded holding company owns two sub-companies: one owns capital assets like land and buildings, which it leases to the other sub-company that operates them.
Sounds like double ungood day for Linode
That separation also insulates one company from a failure of the other; if the company using the building goes bankrupt, the building itself is unaffected (beyond needing to find a new tenant).
Corporate separation
Sounds like double ungood day for Linode