The first 30 minutes of an IT incident decide the MTTR. Great teams spend those minutes in a specific shape. This post covers what we see, plus the anti-patterns that make incidents worse.
Minute 0-2: triage
The pager fires. Great teams:
- Acknowledge in under 60 seconds
- Open the affected dashboards before opening the chat
- Read the alert details twice before posting
Anti-pattern: opening Slack first. Every second you spend asking “is anyone else seeing this?” is a second not spent figuring out what’s happening.
Minute 2-5: confirm and assess
Great teams:
- Confirm the alert is real (not a monitor flap)
- Assess user impact (is this visible to customers yet?)
- Declare an incident with a severity tier
Anti-pattern: skipping the severity declaration. Without one, nobody knows whether this is a 3-person war room or a “I’ll handle it.”
Minute 5-10: context gathering
Great teams:
- Pull the last hour of relevant metrics
- Check recent deploys (the #1 cause of incidents is a change)
- Scan the affected service’s logs for the first minute of the spike
- Identify likely subsystems
Anti-pattern: going deep on one hypothesis without considering others. Confirmation bias eats hours.
Minute 10-20: hypothesis formation and first action
Great teams:
- State a hypothesis explicitly
- Identify the smallest safe action that would confirm or falsify it
- Take that action and observe
Anti-pattern: “let’s just restart it.” Restarting without a hypothesis is how you lose the state needed to understand the root cause.
Minute 20-30: first mitigation
Great teams:
- Apply a mitigation (not necessarily the root-cause fix)
- Monitor for 5 minutes before claiming recovery
- Keep the incident open even after mitigation; root cause is still pending
Anti-pattern: closing the incident at mitigation time. Premature closure means the post-mortem never happens.
What great teams don’t do
- They don’t have 12 people on the call. They have 3-5, tightly scoped roles.
- They don’t speculate in the ticket. They state observations.
- They don’t skip the post-mortem because “we fixed it.”
- They don’t reuse the same hero every time. Practice is how the team gets better.
The rotation practice
The best teams run incident rotations where juniors lead under a senior’s supervision. The senior is there to stop a catastrophe, not to take over. This is how you scale incident response capability across the team, instead of relying on one or two firefighters.
The platform piece
A tool that removes “get to the affected system” from minute 5-10 is a big compression. When access is ambient (click the alert, get a shell), context gathering collapses from 10 minutes to 1.
LynxTrac is free forever for up to 2 servers, no card required. If you want to try it on real infrastructure instead of reading about it: app.lynxtrac.com.
Related posts
Incident response without VPN access: a practical guide
Your pager just went off and the VPN is down. What follows is a practical runbook for getting to the affected system, gathering context, and fixing it without tunnels.
How modern RMM tools reduce MTTR (mean time to resolution)
Where the minutes actually come from when you switch to a modern RMM. It's less about fixing faster and more about starting sooner.
SSO and built-in XDR land in LynxTrac
Two things teams kept asking for are now live: single sign-on over SAML and OpenID Connect, and a Wazuh-powered XDR and SIEM suite on the agent you already run.