LWN: Comments on "Netflix releases open-source crisis-management tool" https://lwn.net/Articles/824739/ This is a special feed containing comments posted to the individual LWN article titled "Netflix releases open-source crisis-management tool". en-us Fri, 05 Sep 2025 16:10:52 +0000 Fri, 05 Sep 2025 16:10:52 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Netflix releases open-source crisis-management tool https://lwn.net/Articles/825340/ https://lwn.net/Articles/825340/ ncm <div class="FormattedComment"> A determiner of whether an organization is serious about reliability is its willingness to engineer its own scheduled and unscheduled equipment- and data center outages.<br> <p> There should be an absolute minimum of possible responses to failure, so that all are exercised frequently. The most likely failure mode of any intended-reliable system is in its failure recovery mechanisms. In a good design, they are the same as are used to bring regular services up and down.<br> <p> <p> </div> Sun, 05 Jul 2020 23:26:27 +0000 Netflix releases open-source crisis-management tool https://lwn.net/Articles/825307/ https://lwn.net/Articles/825307/ k3ninho <div class="FormattedComment"> I understand that coordination tools are less important to small teams but &#x27;one-off&#x27; incidents, especially security, need you to stay calm with a proven playbook, share the facts that drive your decisions and log the actions you&#x27;ve taken to a timeline. That&#x27;s why I was disappointed to read this:<br> <p> <font class="QuotedText">&gt;When trying to move quickly to address a one-off security incident, not having to learn a new tool to do so is a positive. This makes sense for an organization with a staff the size of Netflix, where teams may not all work directly together. Notably it could also be overkill for smaller organizations, where there is more day-to-day interaction between all of the personnel. </font><br> <p> The cost/benefit for smaller organisations is sometimes a sticking point, but I like turning incident-handling scripts into playbooks so you lock in how to return a system to working state -- an approach which is a huge gain whatever size of your organisation. The win is for a reason that side-steps cost/benefit: take pressure off thinking creatively so you can attend to making a correct diagnosis and applying fixes you&#x27;ve already proven to the systems that need them (and often confusion over terminal windows means you mess with the wrong machines). It&#x27;s great for smaller organisations as much as incient management tools are needed to reach between the different siloes of skill in larger organisations. <br> <p> K3n.<br> </div> Sun, 05 Jul 2020 13:40:46 +0000 Netflix releases open-source crisis-management tool https://lwn.net/Articles/825223/ https://lwn.net/Articles/825223/ NYKevin <div class="FormattedComment"> (In this comment, &quot;SRE&quot; means &quot;site reliability engineering&quot; - it&#x27;s a specific discipline focused on engineering production systems to make them more reliable with less manual human labor.)<br> <p> <font class="QuotedText">&gt; So far, the project doesn&#x27;t appear to have formed a strong community outside of the original developers at Netflix, and it is difficult to understand exactly how many outside contributors there have been.</font><br> <p> That is a crying shame. In my 9-5, our proprietary equivalent has been extremely valuable. If anyone is looking for a quick way to &quot;get better at SRE,&quot; incident management is not a bad place to start. In particular, automatically creating postmortems, and importing relevant data into the template, makes them far more likely to actually get written. This, in turn, makes it easier for the organization to create and track action items, and to prioritize them. Those who refuse to learn from past outages are doomed to repeat them.<br> <p> Ultimately, however, SRE is more than a box of tools. Your organization has to actively value reliability, and it must be willing to sacrifice a little development velocity in the short run (but not in the long run - the whole point of SRE is to maximize both reliability and velocity or &quot;move fast and don&#x27;t break things&quot;). If an organization&#x27;s hierarchical structure is too rigid, or too sales-driven, no amount of tooling will save you.<br> </div> Fri, 03 Jul 2020 17:35:35 +0000