— This is Part 3 of Load Impact’s Velocity NY Preview Series. Load Impact is chatting with some of the cutting-edge developers and executives who will be speaking at Velocity NY Oct. 12-14.
“It was impossible to get regular work done because we were running around putting out fires all day.”
Does that sound familiar?
When it comes to your website, app, API, SaaS product or infrastructure, a minor problem can turn into a major crisis very quickly, and that can hurt your reputation with customers and cost you time and money.
That’s why Blackrock 3 Partners, a team made up of firefighters and technology professionals, are coming to Velocity NY to teach you the finer points of incident management.
In their tutorial, Incident Management for DevOps, Rob Schnepp, Ron Vidal and Chris Hawley will demonstrate the parallels between putting out a five-alarm fire in an apartment building and responding to a data breach.
“There’s a lot of interest in how the fire service does business because we look organized and it works,” said Schnepp. “But there’s a mystique about it because not everyone understands how organized and structured it really is.”
Blackrock 3 uses terms like “Peacetime vs. Wartime” communication and operations, “war games in production” and other phrases traditionally used by the military.
That’s not because a crashed server is equivalent to a person being seriously injured in battle, but it’s because handling adverse conditions is a skill that can be learned, practiced and fine-tuned.
The team at Blackrock 3 stresses that software companies can create an ecosystem to respond to emergencies, minimize impact and learn from those experiences. That includes setting strategies for immediate response, practicing how to start correcting problems in the middle of the crisis and designating an “incident commander.”
In order to do that, Blackrock 3 often goes to their “war games in production” strategy with their clients, which can be surprising to some.
“There are times where we go in to work with a company and plan to break stuff on purpose,” said Vidal. “Sometimes people are taken back by that at first, but how else can you prepare for the randomness of the world unless you really have to solve a problem under some level of pressure?”
After an incident has been controlled and resolved, Blackrock 3 puts a heavy focus on thorough after action reviews — commonly known by many as “post mortems.” Emergency services even have a structured plan for post mortems, too, which is another practice Blackrock 3 is bringing to its partners.
“Post mortems almost always focus on the technology aspect of a problem,” said Schnepp. “They rarely evaluate the human response and how to make that better.”
Blackrock 3 suggests striving for honest, blame-free after action reviews that analyze people’s thought process and logic during a crisis and how future training can improve responses moving forward.
While people normally wouldn’t think the fire department or other emergency services has much in common with technology companies on the surface, Schnepp and Vidal said startup founders, CTOs and everyone they’ve worked with “gets it” from the beginning.
“The same management tactics people use on oil spills can work in the tech business,” said Schnepp. “It’s not a magical formula, but the results are magical.”
Check out Blackrock 3’s Book
The team’s vast experience responding to a wide range of catastrophic events not only led them to forming Blackrock 3, but they recently authored the book, Incident Management for Operations, published by O’Reilly Media.