RMM automation recipes: workflows that save hours every week
Seven specific automations our customers run across their fleets, ranked by how often they fire and how much pager noise they prevent.
Every team I’ve worked with has some version of the same observation: the thing eating up ops capacity isn’t the hard problems, it’s the repetitive ones. The 2am page that’s always the same service, the weekly disk cleanup, the patch-Tuesday scramble. These are all scriptable. Most teams just never get around to scripting them, because the individual pain isn’t bad enough to justify the work.
Ranked by how much quiet they buy you, these are the automations our customers stand up first.
1. Service auto-restart with a circuit breaker
Trigger: a named service is down for more than 30 seconds.
Action: restart it. If it fails three times in ten minutes, stop restarting and page a human.
The circuit breaker matters. Without it, you get a flapping service that stays “technically up” forever while masking the real failure. With it, you handle the 95% of the time it’s a transient crash and still surface the 5% that needs attention.
2. Disk space reclamation
Trigger: disk usage above 85% on any volume.
Action: clean a defined safelist of directories (package caches, old log archives, build artifacts), re-check, and if still above 80%, open a ticket with a du report attached.
Teams that put this in report the single biggest drop in pager volume. Disk-full is an astonishingly common cause of 3am incidents, and it’s almost always a slow accumulation of cache files rather than real usage.
3. Certificate renewal
Trigger: a cert on a managed host expires in less than 14 days.
Action: generate a CSR, submit to the internal CA, install the new cert, restart dependent services, verify.
Cert expiry outages are 100% preventable and somehow still happen everywhere. Automating renewal is the single highest-leverage improvement a team can make. Measure it: how many of your last ten surprise outages were cert-related? If the answer is more than zero, this recipe is overdue.
4. Configuration drift detection
Trigger: nightly scheduled run.
Action: hash a defined set of config files against a gold state. For any drift, either remediate automatically (low-risk files) or open a ticket with a diff (high-risk files).
The interesting finding here is how often the drift is benign: a senior engineer SSH’d in to fix something three months ago and forgot to update the baseline. That’s the pattern to catch, not a malicious actor. Run it for a month and you’ll discover your gold state is wrong in about six places.
5. Onboarding pipeline
Trigger: new-hire ticket with a username.
Action: provision the laptop from a gold image, enroll in monitoring and log shipping, assign to the right RBAC group, send a welcome email with access instructions.
This is usually a three-person workflow stretched over a week. As a pipeline, it takes about two hours of wall-clock time, and the new hire is productive on day one instead of day four.
6. Backup verification
Trigger: weekly, on Friday afternoon.
Action: pick the most recent backup of the most critical system, restore it to a sandbox host, run a health check, archive the result.
This is the automation that most teams defer and then regret. Untested backups are a latent incident. Running a restore drill weekly catches silent failures (broken credentials, corrupt tarballs, filesystem issues) when it’s cheap to fix them.
7. Capacity forecasting
Trigger: weekly, on Monday morning.
Action: pull 28 days of disk, memory, and CPU trends per host. Project forward. Ticket any host projected to exhaust a resource within 30 days.
Not a pager-saver directly, but it shifts capacity work from reactive (“disk is full and we need to add storage right now”) to proactive (“we’ll need storage in three weeks, let’s add it Tuesday”). The trend data is already there; the automation just makes someone look at it.
How to sequence these
Start with #2 or #1, depending on which class of pager fires more for you. Those two alone usually drop pager volume by 40%+. Add #3 within the first month; the cost-to-install is low and the payoff is “we didn’t have an outage.”
Leave the capacity and drift ones for later. They’re valuable, but they’re process automations rather than pager-savers, and the marginal time they free up is smaller.
And a soft rule we’ve found useful: don’t automate something you haven’t done manually at least twice. You’ll get the automation wrong the first time.
More on how this works in practice: the features overview, or email [email protected] with questions.
Related posts
From alerts to auto-fix: building self-healing IT systems
Alerts that only notify you about a problem are half a solution. Teams use LynxTrac automations to turn those alerts into auto-remediation without waking a human.
10 essential IT automation workflows using LynxTrac
Here are ten IT automation workflows, from patch deploys to user onboarding, that teams stand up in their first week on LynxTrac.
Automation in IT: from manual tasks to zero-touch operations
Zero-touch operations is not a fantasy. It is a series of small automations that compound, and the path teams take to get there tends to look roughly the same.