Tag incident-management

 Bookmark

Re-framing how we think about production incidents

This is a great post by Shubheksha and talking about the right way to talk about production issues.

Having a blameless culture makes it easier for new/junior engineers getting started with working on production systems, and makes everyone more comfortable working on things where they know they won't get the blame pointed at them.

I've found that, at work, diagnosing issues in our staging environment has given me such a great experience - it's been great to practice dealing with production-like issues in a non-production environment, as it gives you that time to breath, experiment and learn, as well as giving me much greater understanding of the end-to-end system.

Recommended read: Re-framing how we think about production incidents https://shubheksha.com/posts/2019/04/re-framing-how-we-think-about-production-incidents/

 Bookmark

How we respond to incidents

As I've said before, I'm a big fan of how Monzo handles their production incidents because it's quite polished and transparent

Recommended read: How we respond to incidents https://monzo.com/blog/2019/07/08/how-we-respond-to-incidents

 Bookmark

https://monzo.com/blog/2019/06/20/why-bank-transfers-failed-on-30th-may-2019/

This is a really interesting read from Monzo about a recent incident they had. I really enjoy reading their incident management writeups because they show a tonne of detail, yet are stakeholder-friendly.

It's always interesting to see how other banks deal with issues like this, and what they would do to make things better next time.

Recommended read: https://monzo.com/blog/2019/06/20/why-bank-transfers-failed-on-30th-may-2019/ https://monzo.com/blog/2019/06/20/why-bank-transfers-failed-on-30th-may-2019/