First 30 minutes of an IT incident: what great teams do

The first 30 minutes of an IT incident decide the MTTR. Great teams spend those minutes in a specific shape. This post covers what we see, plus the anti-patterns that make incidents worse.

Minute 0-2: triage

The pager fires. Great teams:

Acknowledge in under 60 seconds
Open the affected dashboards before opening the chat
Read the alert details twice before posting

Anti-pattern: opening Slack first. Every second you spend asking “is anyone else seeing this?” is a second not spent figuring out what’s happening.

Minute 2-5: confirm and assess

Great teams:

Confirm the alert is real (not a monitor flap)
Assess user impact (is this visible to customers yet?)
Declare an incident with a severity tier

Anti-pattern: skipping the severity declaration. Without one, nobody knows whether this is a 3-person war room or a “I’ll handle it.”

Minute 5-10: context gathering

Great teams:

Pull the last hour of relevant metrics
Check recent deploys (the #1 cause of incidents is a change)
Scan the affected service’s logs for the first minute of the spike
Identify likely subsystems

Anti-pattern: going deep on one hypothesis without considering others. Confirmation bias eats hours.

Minute 10-20: hypothesis formation and first action

Great teams:

State a hypothesis explicitly
Identify the smallest safe action that would confirm or falsify it
Take that action and observe

Anti-pattern: “let’s just restart it.” Restarting without a hypothesis is how you lose the state needed to understand the root cause.

Minute 20-30: first mitigation

Great teams:

Apply a mitigation (not necessarily the root-cause fix)
Monitor for 5 minutes before claiming recovery
Keep the incident open even after mitigation; root cause is still pending

Anti-pattern: closing the incident at mitigation time. Premature closure means the post-mortem never happens.

What great teams don’t do

They don’t have 12 people on the call. They have 3-5, tightly scoped roles.
They don’t speculate in the ticket. They state observations.
They don’t skip the post-mortem because “we fixed it.”
They don’t reuse the same hero every time. Practice is how the team gets better.

The rotation practice

The best teams run incident rotations where juniors lead under a senior’s supervision. The senior is there to stop a catastrophe, not to take over. This is how you scale incident response capability across the team, instead of relying on one or two firefighters.

The platform piece

A tool that removes “get to the affected system” from minute 5-10 is a big compression. When access is ambient (click the alert, get a shell), context gathering collapses from 10 minutes to 1.

LynxTrac is free forever for up to 2 servers, no card required. If you want to try it on real infrastructure instead of reading about it: app.lynxtrac.com.

MTTR Feb 21, 2026 · 3 min read

Incident response without VPN access: a practical guide

Your pager just went off and the VPN is down. What follows is a practical runbook for getting to the affected system, gathering context, and fixing it without tunnels.

Read article

MTTR Dec 23, 2025 · 4 min read

How modern RMM tools reduce MTTR (mean time to resolution)

Where the minutes actually come from when you switch to a modern RMM. It's less about fixing faster and more about starting sooner.

Read article

Security May 30, 2026 · 4 min read

SSO and built-in XDR land in LynxTrac

Two things teams kept asking for are now live: single sign-on over SAML and OpenID Connect, and a Wazuh-powered XDR and SIEM suite on the agent you already run.

Read article

Minute 0-2: triage

Minute 2-5: confirm and assess

Minute 5-10: context gathering

Minute 10-20: hypothesis formation and first action

Minute 20-30: first mitigation

What great teams don’t do

The rotation practice

The platform piece

Related posts

Incident response without VPN access: a practical guide

How modern RMM tools reduce MTTR (mean time to resolution)

SSO and built-in XDR land in LynxTrac