5 min Reading Time
Last year, a DevOps engineer friend called me at 2:17 a.m. on a Tuesday – right after his first on-call night. Slack alert: “CRITICAL – API Gateway Response Time > 5s – Production.” He stared at a Grafana dashboard with 14 panels – and understood only three. No one had told him that on-call isn’t about solving problems. It’s about knowing which problem to solve first. Here are three lessons from that night – lessons every cloud engineer should know before their first shift.
TL;DR
- Alert fatigue is the real on-call challenge: Teams receiving more than 100 alerts per week ignore roughly 30% of critical notifications (PagerDuty State of Digital Operations, 2025).
- Runbooks without regular reviews and game days are useless – they describe systems that have changed since the last update.
- A postmortem culture isn’t bureaucratic overhead – it’s the only way to learn systematically from incidents instead of repeating them.
2:17 a.m.: The moment nothing works like in the tutorial
Let’s call my friend Max. He’d been on the team for six months. He’d deployed Kubernetes clusters, written Helm charts, and built CI/CD pipelines. He felt ready. What he didn’t know? Everything he’d learned applied to normal operations. On-call begins precisely where normal operations end.
The alert arrived via PagerDuty, routed to Slack, and simultaneously pushed to his phone. Within four minutes, seven more alerts followed – three CRITICAL, two WARNING, two INFO. Max had no idea which of those seven was the root cause – and which were cascading failures. His Grafana dashboard showed red lines everywhere, but he couldn’t tell whether the CPU spike was causing the latency – or merely a symptom of it.
After 20 minutes – and a panicked Slack call with the senior engineer – the issue was resolved: A pod had hit its memory limit and entered a restart loop. The fix? A single line change. But those 20 minutes of panic taught Max more about cloud operations than six months of routine work ever could.
Lesson 1: Signal vs. Noise – the real work happens before the incident
That night, Max received 11 alerts in 15 minutes. Exactly one was relevant. The other ten were either downstream errors – or alerts triggered by thresholds so low they fired on every minor load fluctuation.
What Max wishes he’d known earlier: Alert tuning isn’t a one-time setup task. It’s an ongoing process. Every alert that doesn’t lead to action is noise. Every alert that should trigger action – but gets missed – is a risk. Improving your signal-to-noise ratio is the most important work you do before your first incident – not building dashboards.
His tip: Ask your team which alerts over the past 30 days actually led to concrete action. All the rest belong under review.
Lesson 2: A runbook no one has reviewed isn’t a runbook
At 2:30 a.m., Max opened the runbook. It described an architecture that had been overhauled three months earlier. The microservice labeled “Single Point of Failure” no longer existed. The load balancer cited as the “primary health-check endpoint” had been replaced entirely. He was following documentation that actively misled him.
“An outdated runbook is more dangerous than no runbook at all. Without a runbook, you know you have nothing. With an outdated one, you think you have something – and follow instructions that no longer apply.”
– cloudmagazin editorial assessment
What Max learned: Game days aren’t just fun extras for teams with spare time. They’re the only way to verify that your runbooks work – before you need them in a crisis. His team now runs a simulated incident every six weeks. Since then, the runbook stays current – because each game day exposes exactly where documentation diverges from reality.
Lesson 3: A postmortem culture saves nerves – and reputation
The day after the incident, something unexpected happened: a structured postmortem. No finger-pointing. No blame. Just a clear timeline, root cause, contributing factors, and actionable items – with owners and deadlines. The senior engineer used a five-section, one-page template he’d refined over three years.
The postmortem surfaced three blind spots no one had noticed:
First, the memory leak had existed for two weeks – but its associated alert was configured as “low priority.”
Second, the escalation path was unclear: Max didn’t know when he was allowed – or expected – to call the senior engineer instead of trying to resolve it himself.
Third, the onboarding process for new on-call members lacked a shadowing phase.
All three action items were implemented the following week. Since then, every new team member spends one week doing “shadow on-call” before joining the rotation alone. That’s one week per person. It saves months of frustration.
What Max does differently today
Two years – and roughly 50 on-call nights – later, Max maintains three habits:
He reviews his alerts once per sprint – and deletes any that haven’t triggered meaningful action in the past 30 days.
He updates the runbook after every architecture change – not just after incidents.
He writes a postmortem after every incident – even if it’s “small.” Because small incidents are where you learn the most: they’re often early warnings of larger ones.
If you’re about to take your first on-call shift: You will make mistakes. That’s not the problem. The problem is if your team lacks the structure to learn from them. Ask about game days. Ask about incident-response processes. Ask about shadow on-call. If none of those exist? Build them. That’s the fastest path from junior engineer to the person your team calls at night – not because you know the best solution, but because you know the best process.
Frequently Asked Questions
When should a junior engineer join the on-call rotation?
Only after completing at least one week of shadowing and participating in a full game day. More important than technical knowledge is clarity around the escalation path: When do you escalate? To whom? Via which channel? Without that clarity, no one belongs in the rotation alone.
How many alerts per week are acceptable?
PagerDuty recommends fewer than 40 alerts per week per team as a general benchmark. But the number matters less than the ratio: More than 50% of alerts should lead to action. If that rate falls below 50%, alert tuning is overdue.
Which postmortem format works best?
The simplest one your team will actually use. A proven structure includes: Timeline, Root Cause, Contributing Factors, Impact, and Action Items (with owner and deadline). Keep it to one page max. Google and Atlassian publish their templates freely.
Editor’s Reading Recommendations
Header Image Source: AI-generated mood image (FLUX.2) – no product depiction