Pavol Bincik ·

Reduce Alerts, Improve Response: Fix On-Call Fatigue

30% of security leaders cite alert fatigue as one of their biggest operational challenges, and most teams make it worse by adding more monitoring when incidents increase. The counterintuitive fix: fewer, smarter alerts will improve your incident response times faster than any on-call rotation restructuring you'll ever do.

This isn't about ignoring problems. It's about signal-to-noise engineering.


Why More Alerts Make You Slower

On-call engineers who receive high volumes of low-quality alerts don't become more vigilant. They become desensitized. This is documented behavior, not a character flaw. When your PagerDuty queue fires 47 times on a Tuesday and 43 of those are known noise, you stop treating alert number 44 with urgency. That's exactly when the real incident happens.

The operational math is brutal: teams with high false-positive rates take longer to acknowledge critical incidents, not shorter. A 2023 survey found that engineers receiving more than 10 non-actionable alerts per shift had mean time to acknowledge (MTTA) scores nearly double those of teams with tightly tuned alert thresholds.

On-call burnout follows the same curve. Engineers don't quit because incidents are hard. They quit because they're being woken up at 2am for a disk usage warning that resolves itself by 2:03am.


The Shift From Quantity to Quality

Alert tuning isn't a one-time cleanup task. It's an ongoing engineering discipline with clear principles.

1. Every Alert Must Be Actionable

Before any alert ships to production, it should pass a single-question test: What does the on-call engineer do when this fires?

If the answer is "check if it resolves on its own" or "acknowledge and monitor," that alert should not page anyone. Demote it to a log entry, a dashboard metric, or a weekly digest. Reserve PagerDuty (or equivalent) for alerts where the response is unambiguous and time-sensitive.

Practical threshold: if an alert fires and requires no human action more than 20% of the time, it's a candidate for demotion or elimination.

2. Add Context at the Source

An alert that reads High CPU on prod-web-03 forces the on-call engineer to open three dashboards before they understand scope. An alert that reads prod-web-03 CPU at 94% for 8 minutes, serving 34% of checkout traffic, potential revenue impact can be triaged in 15 seconds.

Context doesn't mean verbosity. It means including:

  • Current value and threshold breached
  • Duration of the condition
  • Business or user impact if known
  • Runbook link or suggested first step

Teams that implement context-rich alerts report 40% faster MTTA on critical incidents. The cognitive load reduction is real.

3. Implement Alert Severity Tiers Rigorously

Most teams have severity levels in theory. Few enforce them in practice.

A workable three-tier model:

Tier Criteria Delivery
P1 User-facing, revenue impacting, requires immediate action Page on-call immediately
P2 Degraded performance, non-critical path, trending toward P1 Slack notification + ticket
P3 Informational, no immediate action needed Dashboard + daily digest

The discipline is in refusing to let P2s drift upward into pages. If a service isn't user-facing, it doesn't warrant waking someone up.


The Infrastructure Categories You're Probably Over-Alerting On

DNS

DNS failures are largely preventable, and yet most teams have aggressive alerting on symptoms (timeouts, 5xx errors) while running zero proactive monitoring on DNS health itself. This creates a pattern where you get paged on downstream effects, often minutes after users started experiencing problems.

Continuous DNS monitoring with redundant resolvers eliminates most of this reactive noise. When you're monitoring DNS health proactively, you catch misconfigurations and propagation failures before they cascade. Teams using tools like PulseGuard for infrastructure monitoring often discover they were receiving 15-20 downstream alerts for what was a single DNS misconfiguration, an alert storm that collapses into one actionable notification with proper monitoring architecture.

SSL Certificates

SSL expiry alerts are a classic alert fatigue contributor. Teams set aggressive early-warning thresholds (90 days, 60 days, 30 days) and end up with recurring alerts that never get actioned because the expiry feels distant. When the real deadline hits, the alert has been mentally filed as background noise.

Better approach: one alert at 30 days, escalating at 7 days. Automate renewal where possible. Cut the intermediate noise entirely.

Uptime Monitoring

Five-minute polling intervals that alert on single-check failures produce significant false-positive rates due to transient network conditions. Require two or three consecutive failures before alerting. This one change can cut false positives by 30-40% for most infrastructure setups.


Status Pages as Incident Response Infrastructure

Alert tuning is about what happens inside your team. Status pages are about what happens outside, and they're an underused tool for reducing alert volume indirectly.

When users have no authoritative source of information during an incident, they file support tickets, send emails, post on social media, and message sales contacts. This creates a secondary workload that lands on engineers who are already triaging the primary incident.

A proactively updated status page, updated every 30 minutes during active incidents with honest and specific language, eliminates most of this inbound noise. Teams that maintain public status pages report 20-30% reductions in support ticket volume during incidents.

The key word is proactively. A status page that only updates after users have already noticed the problem provides no noise reduction benefit. The goal is to answer the question before it gets asked.

PulseGuard's status page functionality is built around this workflow, making it fast to post updates during an incident when your team has limited bandwidth and high cognitive load.


Practical Takeaways

Start here this week:

  1. Audit your last 30 days of pages. For each alert, record whether it required immediate human action. Anything below 80% actionability rate is a tuning target.

  2. Add one piece of business context to your five noisiest alerts. User impact, revenue path, or affected traffic percentage. Measure MTTA before and after.

  3. Set up proactive DNS monitoring separate from your application health checks. DNS issues should never surface first as application timeouts.

  4. Create or update your public status page and document an internal process for posting updates within 15 minutes of incident declaration.

  5. Review your P1/P2 boundary. If more than 20% of your P1 alerts aren't causing engineers to take immediate action, your tier definitions have drifted.

Alert fatigue isn't a people problem. It's an instrumentation problem. Fix the instrumentation.