There are going to be a lot of articles about how people coped with “Superstorm Sandy” over the next days, weeks and months”, Post-mortem of October 22,2012 AWS degradation is from the Netflix Tech Blog.
Probably the most important part of the Netflix piece comes at the end, under the heading “Embracing Mistakes”:
Every time we have an outage, we make sure that we have an incident review. The purpose of these reviews is not to place blame, but to learn what we can do better. After each incident we put together a detailed timeline and then ask ourselves, “What could we have done better? How could we lessen the impact next time? How could we have detected the problem sooner?” We then take those answers and try to solve classes of problems instead of just the most recent problem. This is how we develop our best practices.
Of course the Netflix team’s efforts weren’t a life or death matter, but the lessons that can be learned from their preparation and response are the kind that could make the difference for organizations that do support critical servers that are.
If you’re involved in supporting or designing critical infrastructures take some time to read through the post. I think you might find it both informative and inspiring (hint: there’s a happy ending).