From alerts to auto-fix: building self-healing IT systems
Alerts that only notify you about a problem are half a solution. Teams use LynxTrac automations to turn those alerts into auto-remediation without waking a human.
Alerts that only notify you about a problem are half a solution. Self-healing systems close the loop, detect, act, verify, without waking a human except when judgment is actually required.
What “self-healing” actually means
It does NOT mean “the system magically fixes itself.” It means:
- Known, bounded failure modes have pre-written remediation.
- The platform detects the failure, applies the remediation, verifies success.
- Humans are paged only if the remediation fails, or if the failure is novel.
The architecture
Four components:
- Detector. Monitor that fires on a specific condition.
- Remediator. Script or API call that attempts a fix.
- Verifier. A second check that confirms the fix worked.
- Escalator. Pager / ticket if the verifier fails, or if the remediation has already been tried N times in a window.
Worked example: service auto-restart
- Detector: named service is not running for > 30 seconds.
- Remediator: run
systemctl restart $SERVICE. - Verifier: service is running AND healthcheck endpoint returns 200 within 60 seconds.
- Escalator: if remediator has been invoked 3 times in the past 10 minutes, stop auto-restarting and page.
This one recipe saves a lot of 3 a.m. pages. It also requires a circuit breaker, without one, a genuinely broken service gets restarted in a flap loop that hides the real problem.
Worked example: disk space
- Detector: disk usage > 85%.
- Remediator: clean
/tmp, vacuum old logs, purge package cache. - Verifier: disk usage < 80%.
- Escalator: if disk still > 80%, open a ticket with a du report attached.
The remediation is conservative; it only touches known-safe paths. You never want self-healing that deletes something irrecoverable.
What to automate
Good candidates:
- Known, recurring, low-risk failures (service crashes, disk pressure, log rotation)
- Operations that have a clean rollback (restart, cache flush, config reload)
- Operations with a clear success signal (healthcheck, metric threshold)
Bad candidates:
- Novel failures (automation makes it worse)
- High-blast-radius operations (restart the whole DB, truncate tables)
- Operations without a verifiable success signal
The meta-metric
Track “auto-remediation success rate.” You want this high (70%+) but not 100%, a 100% rate means you’ve under-automated, and the remediations you haven’t written yet are pagers that are still firing.
What humans still own
- Novel failure classes
- Judgment calls (is this a real outage or a noisy monitor?)
- Writing and reviewing new remediations
- Deciding when to stop automating a thing (if it keeps firing, maybe the underlying problem needs a real fix)
Two servers, free forever. Sign up at app.lynxtrac.com if any of this resonates.
Related posts
RMM automation recipes: workflows that save hours every week
Seven specific automations our customers run across their fleets, ranked by how often they fire and how much pager noise they prevent.
10 essential IT automation workflows using LynxTrac
Here are ten IT automation workflows, from patch deploys to user onboarding, that teams stand up in their first week on LynxTrac.
Automation in IT: from manual tasks to zero-touch operations
Zero-touch operations is not a fantasy. It is a series of small automations that compound, and the path teams take to get there tends to look roughly the same.