Automation · By · 4 min read

RMM automation recipes: workflows that save hours every week

Seven specific automations our customers run across their fleets, ranked by how often they fire and how much pager noise they prevent.

Every team I’ve worked with has some version of the same observation: the thing eating up ops capacity isn’t the hard problems, it’s the repetitive ones. The 2am page that’s always the same service, the weekly disk cleanup, the patch-Tuesday scramble. These are all scriptable. Most teams just never get around to scripting them, because the individual pain isn’t bad enough to justify the work.

Ranked by how much quiet they buy you, these are the automations our customers stand up first.

1. Service auto-restart with a circuit breaker

Trigger: a named service is down for more than 30 seconds.

Action: restart it. If it fails three times in ten minutes, stop restarting and page a human.

The circuit breaker matters. Without it, you get a flapping service that stays “technically up” forever while masking the real failure. With it, you handle the 95% of the time it’s a transient crash and still surface the 5% that needs attention.

2. Disk space reclamation

Trigger: disk usage above 85% on any volume.

Action: clean a defined safelist of directories (package caches, old log archives, build artifacts), re-check, and if still above 80%, open a ticket with a du report attached.

Teams that put this in report the single biggest drop in pager volume. Disk-full is an astonishingly common cause of 3am incidents, and it’s almost always a slow accumulation of cache files rather than real usage.

3. Certificate renewal

Trigger: a cert on a managed host expires in less than 14 days.

Action: generate a CSR, submit to the internal CA, install the new cert, restart dependent services, verify.

Cert expiry outages are 100% preventable and somehow still happen everywhere. Automating renewal is the single highest-leverage improvement a team can make. Measure it: how many of your last ten surprise outages were cert-related? If the answer is more than zero, this recipe is overdue.

4. Configuration drift detection

Trigger: nightly scheduled run.

Action: hash a defined set of config files against a gold state. For any drift, either remediate automatically (low-risk files) or open a ticket with a diff (high-risk files).

The interesting finding here is how often the drift is benign: a senior engineer SSH’d in to fix something three months ago and forgot to update the baseline. That’s the pattern to catch, not a malicious actor. Run it for a month and you’ll discover your gold state is wrong in about six places.

5. Onboarding pipeline

Trigger: new-hire ticket with a username.

Action: provision the laptop from a gold image, enroll in monitoring and log shipping, assign to the right RBAC group, send a welcome email with access instructions.

This is usually a three-person workflow stretched over a week. As a pipeline, it takes about two hours of wall-clock time, and the new hire is productive on day one instead of day four.

6. Backup verification

Trigger: weekly, on Friday afternoon.

Action: pick the most recent backup of the most critical system, restore it to a sandbox host, run a health check, archive the result.

This is the automation that most teams defer and then regret. Untested backups are a latent incident. Running a restore drill weekly catches silent failures (broken credentials, corrupt tarballs, filesystem issues) when it’s cheap to fix them.

7. Capacity forecasting

Trigger: weekly, on Monday morning.

Action: pull 28 days of disk, memory, and CPU trends per host. Project forward. Ticket any host projected to exhaust a resource within 30 days.

Not a pager-saver directly, but it shifts capacity work from reactive (“disk is full and we need to add storage right now”) to proactive (“we’ll need storage in three weeks, let’s add it Tuesday”). The trend data is already there; the automation just makes someone look at it.

How to sequence these

Start with #2 or #1, depending on which class of pager fires more for you. Those two alone usually drop pager volume by 40%+. Add #3 within the first month; the cost-to-install is low and the payoff is “we didn’t have an outage.”

Leave the capacity and drift ones for later. They’re valuable, but they’re process automations rather than pager-savers, and the marginal time they free up is smaller.

And a soft rule we’ve found useful: don’t automate something you haven’t done manually at least twice. You’ll get the automation wrong the first time.


More on how this works in practice: the features overview, or email [email protected] with questions.

Related posts