I recently discovered a podcast I’ve been finding really interesting. It’s called Signals to Danger, and in each episode, railwayman Daniel Fox covers a different accident from British railway history, looking at how it happened and how the lessons learned were applied to the system going forward. As I’ve listened through the episodes, I’ve found myself thinking about the parallels between railway safety and maintaining a resilient software project. So please indulge me in perhaps the nerdiest piece I’ll ever write.
Lesson 1: Take post mortems seriously
In each episode of Signals to Danger, Dan refers extensively to the official report produced after the accident. This is as true of tragedies with triple-digit death tolls as it is of lucky near-misses. These reports never take anything for granted. They never jump to conclusions. They rely only on the evidence at hand.
As important as understanding why an accident happens is understanding what can be done to reduce the risk of a repeat occurrence. The official reports after a railway incident invariably feature recommendations for safety improvements. The British railway network grew up a little bit like an early-stage software project in a startup: haphazardly and without a lot of safeguards. Unfortunately, the stakes were a lot higher. But each failure led directly to improvements, from improved working practices to technological solutions.