It’s 2:47 a.m. The pager goes off. You roll over, open your laptop, and the VPN won’t connect. Either the concentrator is having a moment, your ISP is doing something creative, or the on-call playbook from 2021 still assumes you’re at the office. This post is about how to keep responding anyway.
The core problem
VPN is load-bearing for most incident response runbooks. When it’s down, you can’t reach the affected system, and frequently the affected system is what is taking the VPN down. Fixing a failed VPN concentrator while ops is paging you is the opposite of a fast recovery.
The substitute: outbound-agent access
LynxTrac (and similar outbound-tunnel tools) don’t depend on your VPN because the target’s agent is already connected outbound to a relay. You authenticate to the relay via SSO, and you get a shell or a desktop regardless of your VPN state.
Practical consequence: if your VPN is down, you can still recover services that matter.
The runbook
- Open the dashboard. You need monitoring data first. Without context, you are flailing.
- Confirm the alert. Is it a real outage or a noisy monitor? Five seconds saved here costs nothing.
- Get a shell. Click the affected host, get a terminal. You are now as able as you would have been on the VPN.
- Collect before you fix. Grab logs, metrics, process tree. You will want this for the post-mortem.
- Act. Run your remediation. Document what you did in the session (LynxTrac auto-captures the keystrokes anyway).
- Verify. Monitor the host for 5 minutes after the fix; premature declaration of recovery is the leading cause of reopens.
- Hand off or sleep. Update the ticket, tag the on-call follow-up, go back to bed.
What to watch
If your access depends on a single relay region, a relay-region outage breaks your response. LynxTrac relays run multi-region with automatic failover, but verify this on a non-incident day with a tabletop exercise.
Also: the control plane is now part of your critical path. Treat it with the same uptime rigor you’d want for your status page.
The meta-lesson
Every piece of infrastructure in your incident response runbook is itself subject to incidents. The goal isn’t to remove dependencies (you can’t); it’s to make sure the dependencies are more reliable than what you’re responding to.
Outbound tunnels are not immune to outages. They are, empirically, much more reliable than self-hosted VPN concentrators, because the failure modes that plague concentrators (NAT traversal, IP rotation, client version drift) are simply not part of the model.
Two servers, free forever. Sign up at app.lynxtrac.com if any of this resonates.
Related posts
First 30 minutes of an IT incident: what great teams do
The first 30 minutes make or break MTTR. Here are the concrete moves high-performing teams make, and the anti-patterns we see everywhere else.
How modern RMM tools reduce MTTR (mean time to resolution)
Where the minutes actually come from when you switch to a modern RMM. It's less about fixing faster and more about starting sooner.
SSO and built-in XDR land in LynxTrac
Two things teams kept asking for are now live: single sign-on over SAML and OpenID Connect, and a Wazuh-powered XDR and SIEM suite on the agent you already run.