How I Cut My On-Call Stress in Half

Q: Does monitoring automation actually reduce stress or just add tools?

It reduces stress only if it reduces noise. More dashboards you have to watch is worse. Automation that quietly fixes the trivial stuff and pages you only for the real thing is the goal.

For a long time, the worst part of my week was the moment I picked up the on-call phone.

Not the pages themselves. The waiting for them. That low hum of dread that follows you to dinner, into bed, onto the one weekend you finally took off. You're never fully off. Your nervous system knows it.

I cut that stress roughly in half. Not by being braver, and not by working harder during incidents. By changing a handful of unglamorous things. Here's exactly what they were.

Quick Answer

On-call stress comes less from incidents and more from uncertainty, noise, and being alone with it. You cut it by killing low-value alerts, writing runbooks so 3am-you doesn't have to think, making the on-call load visible to the people who can fix it, and treating recovery time as part of the job, not a luxury. None of it is heroics. All of it is boring, and the boring stuff is what works.

The dread was the real problem, not the pages

I tracked it for a month, honestly. I was paged far less than my anxiety implied. Most weeks: a couple of real things, a pile of noise.

So why did it feel constant? Because the anticipation never stopped. Stress isn't really about the incident. It's about the unpredictability of the incident, plus the feeling that when it comes, you'll be alone, half-asleep, and unsure what to do.

Once I understood that, the fix got clearer. I didn't need to handle incidents better. I needed to attack the uncertainty around them. Three levers: fewer surprises, less thinking required when surprised, and not being alone. Learning to run calmly toward operational pain instead of away from it is, honestly, a big part of the brutal truth about becoming a senior developer.

A person looking stressed in front of a laptop screen at night Photo by John Schnobrich on Unsplash

I declared war on noisy alerts

This was the single biggest win, by a mile.

When I actually audited our alerts, most of them were garbage. Not wrong exactly — just useless. Things that fired and then resolved themselves before I'd even rubbed my eyes. Things that paged a human for a problem no human could act on. A disk warning at 71% that had been at 71% for a year.

Every one of those alerts was a withdrawal from my trust and my sleep. And worse, the noise trained me to ignore pages, which is how the real one slips through.

So I ran every alert through one brutal question:

If this fires at 3am, is there an action a human must take right now? If not, it is not a page.

Everything that failed that test got demoted. Some became dashboard metrics. Some became next-morning tickets. Some I deleted entirely. A few got auto-remediated with a small script, so the system fixed itself and told me in the morning.

My page volume dropped by more than half. And the pages that remained actually meant something, which paradoxically made each one less stressful — because I trusted that if the phone buzzed, it was real.

Here's the filter I now apply to every alert:

Alert type	Action	Where it goes
Human must act now	Keep as page	On-call phone
Important but can wait	Demote	Morning ticket queue
Informational trend	Demote	Dashboard only
Self-resolving flap	Suppress or auto-fix	Nowhere / script
Nobody knows what it means	Delete	Gone

I wrote runbooks for 3am-me, who is an idiot

Awake-me is competent. 3am-me, jolted from deep sleep, is not. 3am-me forgets how to spell, panics, and makes things worse.

The fix is to assume 3am-me is a stranger who knows nothing and write down the steps for them.

For every alert that survived the cull, I wrote a short runbook. Not an essay. A checklist:

What this alert actually means, in one plain sentence.
The first command to run to confirm what's happening.
The most common cause and the fix.
The "if that didn't work" branch.
Who to escalate to, with their actual name.

The relief was instant. The page stopped being "oh no, what is this" and became "ah, this one, I have the card for it." I'd taken the thinking out of the worst possible moment to think.

The bonus: writing the runbook forced me to actually understand each alert in daylight, which meant I fixed several root causes I'd been blindly restarting for months. That's the same shift I describe in the debugging method that changed how I work — replacing panic with a calm, repeatable procedure. Google's web.dev guidance on reliability makes the same argument: clear, observable signals beat heroics every time.

I made the on-call load impossible to ignore

For a long time, on-call pain was invisible. I'd suffer through a brutal night and the team would never know. The backlog of "this keeps paging us" never got prioritized, because it lived in my private misery, not in anyone's planning.

So I made it loud. Every incident, even small ones, got a quick note in a shared channel. Every recurring page became a ticket with a label, and I brought the count of those pages to planning every single week.

An incident that only the on-call engineer feels will never get fixed. Make the pain shared and the org will help you kill it.

Suddenly the team could see that the same flaky service paged someone four times last month. That's a number a manager can act on. Reliability work stopped being a thing I begged for and became an obvious priority, because the cost was finally visible.

This is the part people skip. You cannot quietly absorb on-call pain and also expect it to improve. The absorbing is exactly what prevents the improving.

A clean data dashboard with charts and metrics Photo by Luke Chesser on Unsplash

I started treating recovery as part of the job

The last piece was personal, and I resisted it the longest.

If a page wrecked my sleep, I stopped pretending the next day was normal. I'd take the morning, or start late, or skip the non-essential meetings. Not as a treat. As maintenance. A tired engineer makes the next incident worse.

I also asked for, and got, a few sane norms that I'd recommend to anyone:

A real handoff. Five minutes at the start of each rotation, covering anything currently weird. Walking in blind is half the dread.
A backup. A secondary who can step in if the primary is drowning or asleep on their feet. Just knowing they exist lowers the baseline anxiety.
Comp time after rough nights. If on-call eats your night, it shouldn't also eat your day. This has to be cultural, not heroic.

None of this is soft. A well-rested on-call engineer resolves incidents faster and breaks less. Recovery isn't the opposite of reliability. It's part of it.

I'd add one more habit that quietly lowered my baseline dread: the blameless post-incident note. After anything real, I'd write a short, calm summary — what happened, what I did, what would have made it easier next time. Not to assign fault. To turn a stressful night into a permanent improvement, so the same surprise never costs the next person a full night again. Over a few months, those notes became a library of "we already solved this," and the unknown — the thing that actually drives the dread — kept shrinking.

FAQ

Q: What if my company won't let me delete alerts? Start by demoting, not deleting — move noise off the pager and onto a dashboard. Bring the data on alert volume to your lead. "We page a human X times a week and Y% require no action" is an argument that wins. Numbers move people that complaints don't.

Q: Isn't auto-remediation risky? It can be, so start tiny and safe: clearing a cache, restarting a known-flaky worker, rotating a log. Anything destructive stays human. The goal is to remove the trivial 3am wake-ups, not to automate judgment.

Q: How do I write runbooks without huge time investment? Write them lazily, one alert at a time, the morning after each page. You'll have the freshest context right after the incident. Keep each one to a screen. A rough runbook beats a perfect missing one.

Q: My team is too small for a secondary. What then? Even an informal "if you're truly stuck, you're allowed to wake me" agreement with one teammate changes the psychology. The point is removing the feeling of being utterly alone, which doesn't require a big rotation.

Q: Does monitoring automation actually reduce stress or just add tools? It reduces stress only if it reduces noise. More dashboards you have to watch is worse. Automation that quietly fixes the trivial stuff and pages you only for the real thing is the goal.

The bottom line

I used to think being good at on-call meant being a hero — calm under fire, awake at any hour, grinding through whatever the system threw at me.

It's the opposite. Good on-call is boring. It's quiet pagers, clear runbooks, visible pain, and engineers who are allowed to rest.

The goal of on-call isn't to be a hero at 3am. It's to make 3am uneventful.

I didn't get tougher. I made the system kinder, and that turned out to be the same thing as making it more reliable.

If even one of these changes spares you a single 3am wake-up, it was worth writing — try the alert cull first, and keep reading the senior-developer notes if you want the rest of the unglamorous reliability playbook.

If on-call is grinding you down, you don't need more grit. You need fewer alerts, better runbooks, and the nerve to make the pain visible. Which of those could you start this week?