Efficient On-Call Practices For SREs

Schedule With Respect (Better Rotations)

I’ve been in positions where I was on-call 14–15 days out of the month. As you can imagine, that’s not a great quality of life. It wasn’t even really the fact of being on call 2 weeks out of the month because I typically have my laptop/iPad with me anyways, so carrying that around wasn’t a big deal. It was the fact that there were several recurring issues and management didn’t want to take the time to fix the issue, which would have helped them reduce tech debt.

Define Escalation Paths

One of my first jobs out of school was as a Support Engineer. The responsibilities were broad, but one of them was to be on-call. Since I just came out of school, I barely knew how to work a backup server, let alone fix an application or a system that was down after-hours. The worst part was that the escalation path was to VPs and C-levels (it was a small company, but not that small). Because of that, you can imagine they weren’t pleased to have the issue escalated to them (which begs the question; why be an escalation point?).

  • Who’s the team lead or code owner for that application or service?
  • Who are the most senior engineers? They should be last as an escalation point when no one else can solve the issue
  • Should entry-level engineers actually be on-call?

Handle It With Automation

Even though we live in a world of everyone throwing the word automation out at everything, there are still two cornerstones of tech that are lacking automation — networking and on-call.

Managers — Keep A Cool Head

Two of the biggest problems bad managers have are:

  • They panic when things get serious
  • They try to go on witch hunts and point fingers

Pay Overtime

Money doesn’t buy back time and it definitely doesn’t make waking up at 2:00 AM any easier, but it softens the blow a little bit. As an SRE, you know you’ll have to be on-call at some point. The question is; are there better on-call options? The answer is yes.

  • Employees are more inclined to actually fix an issue instead of doing the bare minimum to get back to bed.
  • Employees want to put in the effort to fix the issue because they have a cash incentive.
  • Management is way smarter about employees being on-call and what they should get alerted for. It makes the employee’s life much better and it saves the organization money to actually go and fix the issue instead of just restarting a service and not fixing the underlying code, ultimately decreasing tech debt.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Michael Levan

Michael Levan

Leader in Kubernetes consulting, research, and content creation ┇AWS Community Builder (Dev Tools Category)┇ HashiCorp Ambassador