Shared by kislayverma

Why on-call pain is a sociotechnical problem | LeadDev

[Highlight] It is engineering’s job to own their code in production. It is management’s job to make sure it doesn’t suck. This is a two-way handshake [Highlight] On-call rotations are a classic example of a sociotechnical problem. A sociotechnical system consists of three elements: in this case that’s your production system, the people who operate it, and the tools they use to enact change on it. [Highlight] Responsibility for your code is increasingly non-optional [Highlight] Our systems are becoming exponentially more complex, and feedback loops are tightening. [Highlight] The point is that tightening these feedback loops is how we make systems better [Highlight] Managers’ performance should be evaluated by the four DORA metrics, as well as a fifth; how often is their team alerted outside of working hours? [Highlight] Everything you support, and every alert that can fire, should have a team that owns it. [Highlight] If you have ten different on-call rotations for various areas of the code base, but any time the database gets slow all ten of you get paged, this is a bad situation [Highlight] Alert when users are impacted, not before [Highlight] You need two types of alerts: ‘WAKE ME UP’ and ‘Deal with this later.’ No more, no less. [Highlight] You’re the debugger of last resort because you’ve been responsible for a mission-critical component of the system from the very beginning [Highlight] Software is owned by teams, not by individuals [Highlight] If you have three to four people on call, that’s too much of your life spent lugging around a laptop. [Highlight] five people is a bare minimum. With eight people, you are on call for a highly sustainable one week out of every two months [Highlight] Making on-call costs tangible

Not reshared by anyone