Alert fatigue is how critical issues slip past otherwise sharp teams. The trouble isn’t that alerts are bad; it’s that noise hides signal. Cutting the noise without losing the events that matter takes a specific kind of discipline.
How fatigue actually develops
It’s not one loud alert that does it. It’s 200 alerts a week where 190 are informational or false, and the team starts triaging by acknowledging instead of investigating. When a real alert arrives, it looks exactly like the 190 noise alerts.
Rule 1: alerts are promises
An alert must represent something the receiver needs to act on, now. If it doesn’t, it’s not an alert, it’s a dashboard tile, a ticket, or a report.
If you wouldn’t wake someone up for it, it’s not a page.
Rule 2: delete alerts that haven’t fired meaningfully in 90 days
Every monitor you set up is a promise to yourself to maintain it. If a monitor hasn’t fired meaningfully in 90 days, either the underlying problem went away (delete the monitor) or the threshold is wrong (fix the monitor).
Dead monitors accumulate and eventually the team stops trusting any of them.
Rule 3: group alerts by root cause
If one network blip generates 50 alerts, that’s one alert with 50 confirmations, not 50 alerts. Group by root cause at the alerting layer so the pager fires once.
Rule 4: tier severity honestly
- P1 (page): user-visible impact, revenue loss, or SLA breach imminent.
- P2 (urgent ticket): team-visible issue, not user-visible, but needs today.
- P3 (ticket): operational hygiene, fix this week.
- P4 (report): informational, bundled weekly.
Calibrate honestly. If your P1s fire 20 times a week, they’re not P1s.
Rule 5: add context to every alert
An alert without context is “prod-db-02 is mad.” An alert with context is “prod-db-02 CPU > 90% for 5m. Recent deploys: none. Recent traffic: +30% over baseline. Related logs: [link].”
The second one resolves in half the time.
Rule 6: review the alert inventory monthly
- Every monitor has an owner
- Every monitor has a documented remediation
- Every monitor’s last-fired timestamp is visible
- Monitors with firing counts incompatible with their severity get re-tiered
What a healthy alert volume looks like
- 3-7 P1s per week per on-call rotation, with clear user impact each
- 15-30 P2s per week, triaged during business hours
- 50-100 P3/P4s per week, auto-deduplicated and summarized
If you’re getting 200 pages a week, the problem isn’t on-call, it’s that you’re using pages as tickets.
The cultural piece
Reducing alert fatigue is as much a culture problem as a tooling problem. The team has to actually delete monitors. The pager duty has to own the quality of the queue they receive. Management has to not interpret “fewer alerts” as “less monitoring”, it’s the opposite.
More on how this works in practice: the features overview, or email [email protected] with questions.
Related posts
SSO and built-in XDR land in LynxTrac
Two things teams kept asking for are now live: single sign-on over SAML and OpenID Connect, and a Wazuh-powered XDR and SIEM suite on the agent you already run.
First 30 minutes of an IT incident: what great teams do
The first 30 minutes make or break MTTR. Here are the concrete moves high-performing teams make, and the anti-patterns we see everywhere else.
Using AWS KMS for secure SSH credential management
Storing SSH credentials safely is harder than it looks. AWS KMS fits into a modern access flow in specific ways, with specific frictions and pitfalls worth naming.