From alerts to auto-fix: building self-healing IT systems

Alerts that only notify you about a problem are half a solution. Self-healing systems close the loop — detect, act, verify — without waking a human except when judgment is actually required.

What “self-healing” actually means

It does NOT mean “the system magically fixes itself.” It means:

Known, bounded failure modes have pre-written remediation.
The platform detects the failure, applies the remediation, verifies success.
Humans are paged only if the remediation fails, or if the failure is novel.

The architecture

Four components:

Detector. Monitor that fires on a specific condition.
Remediator. Script or API call that attempts a fix.
Verifier. A second check that confirms the fix worked.
Escalator. Pager / ticket if the verifier fails, or if the remediation has already been tried N times in a window.

Worked example: service auto-restart

Detector: named service is not running for > 30 seconds.
Remediator: run systemctl restart $SERVICE.
Verifier: service is running AND healthcheck endpoint returns 200 within 60 seconds.
Escalator: if remediator has been invoked 3 times in the past 10 minutes, stop auto-restarting and page.

This one recipe saves a lot of 3 a.m. pages. It also requires a circuit breaker — without one, a genuinely broken service gets restarted in a flap loop that hides the real problem.

Worked example: disk space

Detector: disk usage > 85%.
Remediator: clean /tmp, vacuum old logs, purge package cache.
Verifier: disk usage < 80%.
Escalator: if disk still > 80%, open a ticket with a du report attached.

The remediation is conservative — it only touches known-safe paths. You never want self-healing that deletes something irrecoverable.

What to automate

Good candidates:

Known, recurring, low-risk failures (service crashes, disk pressure, log rotation)
Operations that have a clean rollback (restart, cache flush, config reload)
Operations with a clear success signal (healthcheck, metric threshold)

Bad candidates:

Novel failures (automation makes it worse)
High-blast-radius operations (restart the whole DB, truncate tables)
Operations without a verifiable success signal

The meta-metric

Track “auto-remediation success rate.” You want this high (70%+) but not 100% — a 100% rate means you’ve under-automated, and the remediations you haven’t written yet are pagers that are still firing.

What humans still own

Novel failure classes
Judgment calls (is this a real outage or a noisy monitor?)
Writing and reviewing new remediations
Deciding when to stop automating a thing (if it keeps firing, maybe the underlying problem needs a real fix)

Try it yourself

LynxTrac is free forever for 2 servers — no credit card, no sales call. Start in under 2 minutes →

Automation Jan 17, 2026 · 3 min read

RMM automation recipes: workflows that save hours every week

Automation is where RMM pays for itself. Here are the highest-leverage workflows our users wire up — from patch validation to drift detection to onboarding.

Read article

Automation Dec 7, 2025 · 3 min read