Insight into Site Reliability Engineering with Niall Murphy

In a recent podcast, I was lucky to have a discussion with Niall Murphy about the role of Site Reliability Engineering. Having contributed to the seminal SRE book, and having experience in this field for many years, it was an honour to get the opportunity to chat with him.

Humans have been thinking about better ways to operate things for millennia, but despite all of this effort and thought, running enterprise software operations well remains elusive for many organisations.

The underlying incentives for both Development and Operations can seem to be at odds with each other. One party wishes to make changes and add new features (Dev), whilst the other ensures that the product or service does not break (Ops). The catch here is that changing the product increases the possibility of something breaking.

As a result of this realisation, many forms of gatekeeping (launch reviews, deep-dives and checklists) have been put in place to ‘help’ mitigate the friction between the two parties, but this is by no means solving the problem. It was very interesting for Niall to share his experience with these problems and explain how the role and philosophy behind SRE help remedy them. During the episode, we were able to delve into some of the key components that constitute SRE, from the value of having an error budget to the realisation that striving for 100% uptime is actually detrimental to the product itself!

You can listen to the episode in its entirety below, or by subscribing to the podcast.