From alerts to auto-fix: building self-healing IT systems

Alerts that only notify you about a problem are half a solution. Self-healing systems close the loop, detect, act, verify, without waking a human except when judgment is actually required.

What “self-healing” actually means

It does NOT mean “the system magically fixes itself.” It means:

Known, bounded failure modes have pre-written remediation.
The platform detects the failure, applies the remediation, verifies success.
Humans are paged only if the remediation fails, or if the failure is novel.

The architecture

Four components:

Detector. Monitor that fires on a specific condition.
Remediator. Script or API call that attempts a fix.
Verifier. A second check that confirms the fix worked.
Escalator. Pager / ticket if the verifier fails, or if the remediation has already been tried N times in a window.

Worked example: service auto-restart

Detector: named service is not running for > 30 seconds.
Remediator: run systemctl restart $SERVICE.
Verifier: service is running AND healthcheck endpoint returns 200 within 60 seconds.
Escalator: if remediator has been invoked 3 times in the past 10 minutes, stop auto-restarting and page.

This one recipe saves a lot of 3 a.m. pages. It also requires a circuit breaker, without one, a genuinely broken service gets restarted in a flap loop that hides the real problem.

Worked example: disk space

Detector: disk usage > 85%.
Remediator: clean /tmp, vacuum old logs, purge package cache.
Verifier: disk usage < 80%.
Escalator: if disk still > 80%, open a ticket with a du report attached.

The remediation is conservative; it only touches known-safe paths. You never want self-healing that deletes something irrecoverable.

What to automate

Good candidates:

Known, recurring, low-risk failures (service crashes, disk pressure, log rotation)
Operations that have a clean rollback (restart, cache flush, config reload)
Operations with a clear success signal (healthcheck, metric threshold)

Bad candidates:

Novel failures (automation makes it worse)
High-blast-radius operations (restart the whole DB, truncate tables)
Operations without a verifiable success signal

The meta-metric

Track “auto-remediation success rate.” You want this high (70%+) but not 100%, a 100% rate means you’ve under-automated, and the remediations you haven’t written yet are pagers that are still firing.

What humans still own

Novel failure classes
Judgment calls (is this a real outage or a noisy monitor?)
Writing and reviewing new remediations
Deciding when to stop automating a thing (if it keeps firing, maybe the underlying problem needs a real fix)

Two servers, free forever. Sign up at app.lynxtrac.com if any of this resonates.

Automation Jan 17, 2026 · 4 min read

RMM automation recipes: workflows that save hours every week

Seven specific automations our customers run across their fleets, ranked by how often they fire and how much pager noise they prevent.

Read article

Automation Dec 7, 2025 · 3 min read

10 essential IT automation workflows using LynxTrac

Here are ten IT automation workflows, from patch deploys to user onboarding, that teams stand up in their first week on LynxTrac.

Read article

Automation Oct 22, 2025 · 3 min read

Automation in IT: from manual tasks to zero-touch operations

Zero-touch operations is not a fantasy. It is a series of small automations that compound, and the path teams take to get there tends to look roughly the same.

Read article

What “self-healing” actually means

The architecture

Worked example: service auto-restart

Worked example: disk space

What to automate

The meta-metric

What humans still own

Related posts

RMM automation recipes: workflows that save hours every week

10 essential IT automation workflows using LynxTrac

Automation in IT: from manual tasks to zero-touch operations