Sounds like double ungood day for Linode

Posted Jul 28, 2025 11:12 UTC (Mon) by intelfx (subscriber, #130118)
Parent article: LWN is back

I’d be _extremely_ curious to read Linode’s postmortem for this outage.

Sounds like double ungood day for Linode

Posted Jul 28, 2025 11:17 UTC (Mon) by corbet (editor, #1) [Link] (21 responses)

You're certainly not the only one wanting to learn something about what happened ... there is the ongoing incident report, but it is less than illuminating. It kind of reminds me of a delayed flight that is always going to board after another 15 minutes.

Sounds like double ungood day for Linode

Posted Jul 28, 2025 11:39 UTC (Mon) by mote (guest, #173576) [Link] (19 responses)

During the Akamai acquisition it was reported Linode (like many others) does/did not own the datacenters, but lease from wholesale providers like Databank (Atlanta), Pacnet (Singapore) and Telecity (Frankfurt). The ongoing incident report reads that the DC they lease from in Newark had a catastrophic failure in power/cooling leading to total power outage across the shared facility (or a segment, depending on physical design).

Linode employees are attempting to recover the entire DC runtime from a hard cold machine start, no small feat - I've worked at a company like this that had a similar outage in the late 2000s due to a truck hitting the building (yeah); once the cooling goes, you *have* to shut down servers or electronic components start to melt themselves.

Sounds like double ungood day for Linode

Posted Jul 28, 2025 13:14 UTC (Mon) by iabervon (subscriber, #722) [Link] (8 responses)

It's always interesting to find out what you really need to do to bring a DC that didn't shut down cleanly back up. You generally think you know, but you're also generally missing something, and that cascades into things you thought would work not actually working, and it's often the things that make a QA environment reasonable to run that are the difference.

Of course, losing power generally means you don't have the servers up longer than the cooling, because there's no point being able to power machines that are just going to overheat. The surprising situation I remember hearing about is when the building loses its water supply, at which point the computers aren't lacking anything they need but the cooling won't work. (If I recall correctly, maintenance work on one of the building's water supplies accidentally damaged the building's other water supply, which is obviously physically nearby.)

Sounds like double ungood day for Linode

Posted Jul 28, 2025 13:46 UTC (Mon) by farnz (subscriber, #17727) [Link] (7 responses)

A previous employer of mine with their own DCs gradually worked towards the point where we could as a matter of routine shut down an entire DC, and bring it back up from cold.

Turns out that when you actively try to build a runbook that lets your operations team shut down the DC and start it back up from scratch, you find an awful lot of unintentionally circular dependencies - and all of the circular dependencies you find are a result of locally sensible reasoning that didn't take into account the complexity of the DC as a whole.

Sounds like double ungood day for Linode

Posted Jul 28, 2025 14:19 UTC (Mon) by paulj (subscriber, #341) [Link] (6 responses)

It's like the time Facebook had a DC go down. Someone had to physically go from the nearest Prod. Eng. office to the DC to start the initial steps for bring-back. When they arrived they found they couldn't get in - the door badge-entry system was out, cause it depended on the DC. They + DC engineers had to resort to some good old physical mallet based engineering to hack their way in.

Sounds like double ungood day for Linode

Posted Jul 28, 2025 16:05 UTC (Mon) by farnz (subscriber, #17727) [Link] (5 responses)

And I suspect that if you could review the chain of decisions that led to "door badge system depends on DC being up and running", every single one would make sense in isolation; it's only when you test "what happens when we take the DC down remotely? Can we get in?" that you discover that these decisions have chained together to imply "when the DC is down, door badge system is also down".

For example, the door badge system might be configured to automatically sync access permissions from internal directories (so that you don't have to manually update DC access permissions as people join and leave), and fail closed if internal directories including all mirrors in other DCs are unavailable (on the basis that you don't want an attacker to be able to cut the OOB Internet link and get in), then you allow the badge system to use DC Internet (because it's more reliable than the OOB link, so you get fewer problems where it can't sync) and then you terminate the OOB Internet link for the badge system (on the basis that the DC has redundant fibres and routers, so won't go down, whereas the OOB link is a consumer link without redundancy). And then you update the config management system so that, for all but legacy systems (like the DC routers), if you don't confirm that a change is good to test within 5 minutes, it autoreverts, so people develop a habit of testing rather than carefully confirming that they've not made a mistake.

All these decisions sound reasonable in isolation, but they chain together to a world where a mistake with the DC's router configurations result in the door badge system locking you out.

Sounds like double ungood day for Linode

Posted Jul 28, 2025 21:02 UTC (Mon) by smurf (subscriber, #17840) [Link] (4 responses)

This is why reasonable badge systems have an offline mode which knows the cards of a few senior DC operators, along with the card that's in the locker which the fire brigade has access to.

The interesting part about cyclic dependencies is that you have customers, which need to be notified when their servers go belly-up for some reason (including when the router they're behind blows a fuse). Thus you need dependency tracking anyway. Which presumably should alert you when there's any cycles in that graph. Which should prevent this from happening. Famous last words …

Sounds like double ungood day for Linode

Posted Jul 29, 2025 0:08 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (3 responses)

Something that LLMs might actually be useful at helping to spot assuming you have things written down in a LLM-consumable medium (i.e., not only stored in meatspace).

Sounds like double ungood day for Linode

Posted Jul 29, 2025 7:33 UTC (Tue) by taladar (subscriber, #68407) [Link] (2 responses)

Why would you ever entrust something like that to an LLM? Here you want reliable answers and algorithms exist to get reliable answers.

Sounds like double ungood day for Linode

Posted Jul 29, 2025 11:02 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

I'm not saying to just blindly hand it to an LLM and call it a day on its first response. But you can use it to ask if it notices anything like a dependency cycle, implicit knowledge, or steps lacking clarification.

It's obviously not going to know about problems like "Bob has the key, but it is lost on his 100-key keyring", but it (hopefully) can notice things like "it says to get the key, but not where the key lives" that might be implicit knowledge in the author's eyes but these instructions are those that really should have a thorough once-over by someone *not* intimately familiar with the process to help rid it of such implicit assumptions.

LLMs looking over your recovery plan

Posted Jul 29, 2025 11:29 UTC (Tue) by farnz (subscriber, #17727) [Link]

It can also do things like decide that certain job titles imply particular people (not reliably, but often enough to be useful), and then highlight that your recovery plans depend on Chris being present at site, because Chris is the Site Manager, the Health and Safety Lead, and the Chief Electrician, and your plans depend on the Site Manager, the Health and Safety Lead, the Chief Electrician, or the Deputy Chief Electrician being on site.

While the LLM made a mistake here (since there's two people you rely on, not one), it's still highlighted that a document saying that 1 of 4 roles is needed to recover has, in fact, become 1 of 2 people.

Sounds like double ungood day for Linode

Posted Jul 28, 2025 13:50 UTC (Mon) by jzb (editor, #7867) [Link]

Yeah, losing HVAC/cooling is no fun. I worked for a Denver-based hosting company in 2003-2004 in the NOC overnights—the control panel for HVAC went on the fritz and nothing could be done to bring it back up. At 2 a.m. or so on a Sunday morning. It was a complete and utter nightmare.

I had to call in a bunch of folks and we ran around shutting things down and notifying customers. I was somewhat gratified that the senior folks had no more success than I had had trying to bring the HVAC back up... IIRC some part had shorted out or whatever on the control panel. (It was a long time ago...)

Sounds like double ungood day for Linode

Posted Jul 28, 2025 23:05 UTC (Mon) by bracher (subscriber, #4039) [Link] (3 responses)

I'm reminded of an incident from early in my career. We were co-located in a DC that sat on two theoretically-independent power grids, plus diesel generators to backup in the unlikely event that both grids went down. It seemed as if these folks had thought of everything, they had really detailed procedures for just about everything.

Naturally both grids eventually _did_ go down, and the generators kicked in, all according to plan. And a few minutes later all of the generators died. It turns out that they procedures for _many_ things, including testing the failover to the generators on a monthly basis. The one thing they did _not_ have a procedure for was the refueling of the generators. So when they were eventually needed they had only a few minutes of fuel left.

Sounds like double ungood day for Linode

Posted Jul 29, 2025 6:42 UTC (Tue) by MortenSickel (subscriber, #3238) [Link] (2 responses)

We are running a small on-site DC at work. There we have a diesel generator that can power the entire building, it is tested (and refuelled) each 3rd month running the entire building off the generator. We also have it serviced once a year. So when we had a power outage about a year ago, the battery powered UPSs kicked in as they should to keep the servers running while the generator starts up, the generator started up, ran for about five minutes, overheated and shut down. (the power outage lasted for a couple of hours).

Just a couple of weeks before, we had done a system test and had the yearly service. It turned out there was some contamination in the cooling liquid that clogged the radiatior which was not detected in the service... The company doing the service got a few questions to answer, but as mentioned elsewhere here, the good thing is that we learned how to get the DC up from nothing.

Sounds like double ungood day for Linode

Posted Jul 29, 2025 7:56 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

I think I got this from Risks ...

The generator in this incident had a five minute header tank, which was re-filled by a pump from the main tank. Guess what the pump was plugged into ... the mains ...

Cheers,
Wol

Sounds like double ungood day for Linode

Posted Jul 29, 2025 14:08 UTC (Tue) by archaic (subscriber, #111970) [Link]

During hurricane Sandy, I (and many others) found ourselves carrying 5 gallon buckets of diesel up 31 flights of stairs due to the pumps being offline. Since this was Manhattan, the combined inability of being able to buy that fuel, plus the ever increasing delays in leaving the island to get it and then trying to get back into Manhattan with it resulted in a human pipeline of nearly 100 people. Just to keep the 10 or so sad sacks from having nothing to carry up those stairs. Ahhh, better days.... :)

Sounds like double ungood day for Linode

Posted Jul 29, 2025 17:44 UTC (Tue) by yodermk (subscriber, #3803) [Link]

> I've worked at a company like this that had a similar outage in the late 2000s due to a truck hitting the building (yeah)

Haha, I was there working support that night. Fun times!

Sounds like double ungood day for Linode

Posted Jul 30, 2025 9:19 UTC (Wed) by anselm (subscriber, #2796) [Link]

once the cooling goes, you *have* to shut down servers or electronic components start to melt themselves.

Way back at university we once had a week or so of unscheduled downtime because legionella bacteria had been discovered in the main campus server room AC and they had to shut everything down in order to get the cooling system disinfected. This was at the height of summer, of course, with temperatures in the 30°C+ range outside.

It sucked especially because apart from the NFS servers everybody's files were sitting on even if they had their own workstation on their desk, it included the machine everyone was reading their e-mail on (this was in the early 1990s when e-mail was still much more of a thing than it is today, but POP3/IMAP hadn't caught on, so you would telnet to the e-mail server and use the mail command there, or MH if you were a serious e-mail nerd), and some sysadmins spent sleepless nights jury-rigging a replacement system in some cool and airy location elsewhere on the premises just so people could get at their e-mail.

Sounds like double ungood day for Linode

Posted Aug 7, 2025 22:36 UTC (Thu) by bartoc (guest, #124262) [Link] (2 responses)

I'm not sure how it works in the EU, but in the USA these sorts of leases are really, really common for all types of real estate.

I think it's because C-corporation taxation is quite punishing compared to other ownership structures, particularly for capital gains realization.

Sounds like double ungood day for Linode

Posted Aug 11, 2025 10:02 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

I've noted a common pattern across the world (including the EU) where a publicly traded holding company owns two sub-companies: one owns capital assets like land and buildings, which it leases to the other sub-company that operates them.

The purpose is not taxation-related in the cases I've seen; rather, it's that the sub-companies have separately audited accounts, and thus you can be confident (as an investor in the holding company) that you're not seeing results "goosed" by accounting tricks to move losses between capital and operational sides. Instead, you've got an audited set of accounts for the capital side, and a separate set for the operational side, which ensures that you cannot be tricked by accounting games.

Corporate separation

Posted Aug 11, 2025 14:03 UTC (Mon) by corbet (editor, #1) [Link]

That separation also insulates one company from a failure of the other; if the company using the building goes bankrupt, the building itself is unaffected (beyond needing to find a new tenant).

Sounds like double ungood day for Linode

Posted Jul 29, 2025 10:35 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

I think the train issue I had a few Christmas's ago was worse.

I was at a complex London terminus. They told us all trains from one set of platforms aren't running due to a tree falling in the storm - which makes sense, tree on line, can't move trains, no service, checks out.

But then every few minutes they would announce "one" exception which was running, the next train for those platforms, and they continued doing this for over an hour. "No trains" "Except the 1145" "No trains" "Except the 1152" "No trains" "Except the 1156" and so on. Rather than say OK, the trains are running normally but..., or actually not having trains because there's a tree in the way they had decided to gaslight their customers by insisting there wouldn't be any trains, then nonetheless running every single train on schedule anyway.

I assume there's some mismatch between strategy to run a day's trains (tree blocks line, OK no trains) and tactics to run individual services (we can just avoid the closed section, for every single train) and it wasn't anybody's job to ensure the outcome makes coherent sense.

The result however was the normal departure boards don't work - there aren't any trains, they've just said so - so you can't tell when your train will run or where from. But if you just give up and leave the station your train does run as scheduled anyway, and you've missed it, you need to know where it ought to be and then get aboard when inevitably it does run after all. Very silly.