Automation · 3 min read

From alerts to auto-fix: building self-healing IT systems

Alerts that only notify you about a problem are half a solution. Here's how teams use LynxTrac automations to turn alerts into auto-remediation.

Alerts that only notify you about a problem are half a solution. Self-healing systems close the loop — detect, act, verify — without waking a human except when judgment is actually required.

What “self-healing” actually means

It does NOT mean “the system magically fixes itself.” It means:

  1. Known, bounded failure modes have pre-written remediation.
  2. The platform detects the failure, applies the remediation, verifies success.
  3. Humans are paged only if the remediation fails, or if the failure is novel.

The architecture

Four components:

  • Detector. Monitor that fires on a specific condition.
  • Remediator. Script or API call that attempts a fix.
  • Verifier. A second check that confirms the fix worked.
  • Escalator. Pager / ticket if the verifier fails, or if the remediation has already been tried N times in a window.

Worked example: service auto-restart

  • Detector: named service is not running for > 30 seconds.
  • Remediator: run systemctl restart $SERVICE.
  • Verifier: service is running AND healthcheck endpoint returns 200 within 60 seconds.
  • Escalator: if remediator has been invoked 3 times in the past 10 minutes, stop auto-restarting and page.

This one recipe saves a lot of 3 a.m. pages. It also requires a circuit breaker — without one, a genuinely broken service gets restarted in a flap loop that hides the real problem.

Worked example: disk space

  • Detector: disk usage > 85%.
  • Remediator: clean /tmp, vacuum old logs, purge package cache.
  • Verifier: disk usage < 80%.
  • Escalator: if disk still > 80%, open a ticket with a du report attached.

The remediation is conservative — it only touches known-safe paths. You never want self-healing that deletes something irrecoverable.

What to automate

Good candidates:

  • Known, recurring, low-risk failures (service crashes, disk pressure, log rotation)
  • Operations that have a clean rollback (restart, cache flush, config reload)
  • Operations with a clear success signal (healthcheck, metric threshold)

Bad candidates:

  • Novel failures (automation makes it worse)
  • High-blast-radius operations (restart the whole DB, truncate tables)
  • Operations without a verifiable success signal

The meta-metric

Track “auto-remediation success rate.” You want this high (70%+) but not 100% — a 100% rate means you’ve under-automated, and the remediations you haven’t written yet are pagers that are still firing.

What humans still own

  • Novel failure classes
  • Judgment calls (is this a real outage or a noisy monitor?)
  • Writing and reviewing new remediations
  • Deciding when to stop automating a thing (if it keeps firing, maybe the underlying problem needs a real fix)

Try it yourself

LynxTrac is free forever for 2 servers — no credit card, no sales call. Start in under 2 minutes →

Related posts