Introducing Autoheal, the AI for Production Engineering

Introducing Autoheal, the AI for
Production Engineering

How to Write an Incident Report: A Step-by-Step Guide for SRE Teams (May 2026)

Learn how to write incident reports for SRE teams with this step-by-step guide. Cover what's broken, who's affected, and status updates. May 2026.

Every incident commander knows the moment: you've just declared a Sev-1, your team is triaging, and someone from leadership or customer success asks "what's the status?" If you don't have a clear answer ready in the next five minutes, you'll spend the rest of the incident fielding DMs instead of coordinating the fix. The difference between teams that handle how to write an incident report well and teams that don't comes down to one thing: they start writing the moment the incident's declared, not after it's resolved. That live document becomes the source of truth everyone pulls from, and the post-resolution report writes itself from there.

TLDR:

  • Incident reports document what's broken, who's affected, and current status during active outages.

  • Post status updates every 15-30 minutes during incidents, even if nothing changed, to prevent stakeholder chaos.

  • Write three versions: engineering gets full technical detail, leadership gets business impact, customers get zero internal service names.

What an Incident Report Is

An incident report is the written record produced during and after a production incident. It captures four things: what is happening, who is affected, what the team is doing about it, and when the next update will arrive.

This isn't a design doc, a changelog, or a Jira ticket. It's a living communication artifact that starts the moment an incident is declared and continues through resolution. In its earliest form, it's a status update for responders and stakeholders. In its final form, it becomes the authoritative account of what went wrong, what was done, and what should change.

If you're writing one for the first time, think of it as two documents stitched together: the real-time log you keep while the incident is active, and the structured report you write once the dust settles.

Incident Report vs Postmortem

People conflate these two documents constantly, and it causes problems. An incident report is written under pressure, while the incident is still active or freshly resolved. A postmortem is written a few days later, with the benefit of hindsight, log analysis, and a 5-Why RCA.

When teams treat them as one document, neither gets done well. The real-time report becomes bloated with root cause speculation. The postmortem loses urgency because "we already wrote it up." Both suffer.

This guide focuses on the incident report. If you're looking for postmortem guidance, that's a separate discipline with its own structure and timeline. Here, we're covering what you write during and immediately after the event.

What a Good Incident Report Contains

Every incident report should answer three questions in its first three lines: what's broken, who's affected, and what's the current status. If a VP opens the doc and can't answer those in five seconds, the report has failed.

Below the fold, include:

  • Incident ID and severity level

  • Incident commander name

  • Start time (UTC)

  • Affected services and customer segments

  • Impact summary in one to two sentences

  • Running timeline of actions taken

  • Next scheduled update time

Status fields matter more than most teams realize. "Investigating" means you don't know the cause yet. "Identified" means you know the cause but haven't acted. "Mitigating" means a fix is in progress. "Monitoring" means the fix is deployed and you're watching. "Resolved" means it's over. Pick one. Don't improvise.

Tone discipline is the hardest part. Write in present tense, state facts, and skip speculation. "API latency exceeds 2s for 40% of requests" works. "We think the database might be overloaded" does not.

Step-by-Step: Writing the Live Incident Report

When the alarm fires, your reporting clock starts. The Google SRE incident management guide stresses the importance of structured response protocols. Here's the sequence:

  1. Declare the incident within 5 minutes. A late declaration means a late first update, which means stakeholders start asking questions in the wrong channels.

  2. Name the incident commander. One person owns communication. No exceptions. The incident commander role is responsible for coordinating all stakeholder communication while the technical team focuses on resolution.

  3. Post the first internal update within 10 minutes. Even if it says "investigating, no root cause identified yet," that update matters.

  4. Send the first external update within 15 to 30 minutes, depending on severity and your SLA commitments.

  5. Update on a fixed cadence. Sev-1: every 15 minutes. Sev-2: every 30 minutes. Sev-3: every hour. If there's nothing new to report, say so. Consistency matters more than novelty.

  6. When the fix deploys, distinguish between "fix in progress" and "fix deployed, monitoring." These are different states and stakeholders need to know which one they're in.

  7. Publish the resolution update. Confirm the incident is over, summarize impact, and note when the full report will follow.

The worst updates are the ones that never arrive. A brief "no change, still investigating" post at the scheduled time builds more trust than a detailed update that shows up 20 minutes late.

Severity Level

Update Cadence

First Internal Update

First External Update

Required Audiences

Communication Scope

Sev-1

Every 15 minutes

Within 5 minutes of declaration

Within 15 minutes

Engineering, leadership, customer success, public status page

Full timeline with decision traces, business impact quantified, customer-facing language with zero internal service names

Sev-2

Every 30 minutes

Within 10 minutes of declaration

Within 30 minutes

Engineering, leadership, customer success, status page if customer-facing

Technical details for engineering, impact summary for leadership, selective customer communication

Sev-3

Every 60 minutes

Within 15 minutes of declaration

Within 60 minutes or at resolution

Engineering, leadership notification

Internal engineering details, leadership notification without requiring action, minimal customer communication

Step-by-Step: Writing the Post-Resolution Report

Once the incident is resolved, you have 60 minutes to publish the post-resolution report. This is the canonical record that lives in your incident management system and feeds directly into the postmortem days later.

The report needs four things:

  1. An overview: what broke, how long it lasted, who was affected.

  2. A timeline with decision traces showing what was done and why, beyond simply when.

  3. A preliminary root cause. Label it "preliminary" explicitly. If you're wrong, nobody's surprised when the postmortem revises the story. If you hide uncertainty now, stakeholders lose trust later.

  4. Follow-up actions with owners and links to relevant tickets, dashboards, or runbooks.

Distribute to predefined audiences within four hours. Engineering, customer success, leadership, and support should all have versions waiting. The longer this document sits in a private channel, the more likely someone fills the vacuum with their own narrative.

The Three Audience Versions

Most teams write one incident report and copy-paste it to every audience. This destroys trust in two directions: leadership gets technical jargon they can't act on, and customers get internal service names they shouldn't see.

You need three versions:

  • Engineering: full technical detail with service names, error rates, deploy IDs, queries run, and decision traces. This is the unfiltered record.

  • Leadership: business impact, customer count affected, duration, estimated revenue impact if known, and follow-up actions in plain language. No stack traces.

  • Customers: what was affected, when it started and ended, what they should do now, and what you're doing to prevent recurrence. Zero internal service names, zero speculation.

The engineering version is your source of truth. The other two are adapted from it, not written from scratch. Get this wrong and you'll spend more time fielding confused Slack messages from your VP than you spent fixing the incident itself.

Common Mistakes That Ruin Incident Reports

These seven mistakes show up repeatedly, and any one of them can undermine an otherwise solid report:

  • Speculating about root cause before mitigation is underway. Stakeholders latch onto your guess, and correcting it later feels like backtracking.

  • Skipping updates when there's nothing new. Silence reads as chaos. Post on cadence, even if the content is "no change."

  • Mixing internal and external language. A customer-facing update that references internal service names erodes confidence instantly.

  • Burying impact below fix details. Your audience needs "what's broken" before "what we're doing."

  • Forgetting to include the next update time. Without it, every recipient invents their own follow-up cadence in your DMs.

  • Closing the report without naming uncertainty. If the preliminary root cause might change, say so. Hidden ambiguity becomes a trust problem.

  • Treating the incident report as the postmortem. The report captures what happened. The postmortem captures why. Collapsing them guarantees neither gets the depth it needs.

How AI Agents Auto-Generate Incident Reports

Everything described in the previous sections is exactly what Autoheal's agents already do during an active incident.

Because Autoheal runs incident orchestration natively inside Slack and Teams, the agents have full visibility into every human decision trace: who triaged what, which hypotheses were rejected, and which fix was approved. The Production Context Graph maps affected services, recent deploys, and customer cohorts in real time, so the agent already knows the blast radius before anyone asks.

During the incident, agents draft internal status updates at your defined cadence and queue them for the incident commander's review. At resolution, they assemble the consolidated report with timeline, impact summary, and preliminary root cause pulled directly from the investigation. The commander still owns the report. The agent removes the writing burden so they can focus on running the response.

Incident Report Template

Copy this into your incident channel or doc and fill in the blanks as the incident progresses.

INCIDENT REPORT (LIVE)

Incident ID:       [INC-XXXX]
Title:              [Short description of the issue]
Severity:           [Sev-1 / Sev-2 / Sev-3]
Status:             [Investigating / Identified / Mitigating / Monitoring / Resolved]
Incident Commander: [Name]
Start Time:         [YYYY-MM-DD HH:MM UTC]

CURRENT IMPACT
[One to two sentences: what is broken, how badly, and for whom.]

AFFECTED SERVICES
- [Internal service name][Customer-facing equivalent]
- [Internal service name][Customer-facing equivalent]

AFFECTED CUSTOMERS
[Segment, region, or count. Be specific.]

TIMELINE
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]

CURRENT ACTIONS IN PROGRESS
- [What is being done right now, and by whom]

NEXT UPDATE
[HH:MM UTC]

WORKAROUNDS
[Any temporary steps customers or internal teams can take, or "None identified."]
INCIDENT REPORT (LIVE)

Incident ID:       [INC-XXXX]
Title:              [Short description of the issue]
Severity:           [Sev-1 / Sev-2 / Sev-3]
Status:             [Investigating / Identified / Mitigating / Monitoring / Resolved]
Incident Commander: [Name]
Start Time:         [YYYY-MM-DD HH:MM UTC]

CURRENT IMPACT
[One to two sentences: what is broken, how badly, and for whom.]

AFFECTED SERVICES
- [Internal service name][Customer-facing equivalent]
- [Internal service name][Customer-facing equivalent]

AFFECTED CUSTOMERS
[Segment, region, or count. Be specific.]

TIMELINE
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]

CURRENT ACTIONS IN PROGRESS
- [What is being done right now, and by whom]

NEXT UPDATE
[HH:MM UTC]

WORKAROUNDS
[Any temporary steps customers or internal teams can take, or "None identified."]
INCIDENT REPORT (LIVE)

Incident ID:       [INC-XXXX]
Title:              [Short description of the issue]
Severity:           [Sev-1 / Sev-2 / Sev-3]
Status:             [Investigating / Identified / Mitigating / Monitoring / Resolved]
Incident Commander: [Name]
Start Time:         [YYYY-MM-DD HH:MM UTC]

CURRENT IMPACT
[One to two sentences: what is broken, how badly, and for whom.]

AFFECTED SERVICES
- [Internal service name][Customer-facing equivalent]
- [Internal service name][Customer-facing equivalent]

AFFECTED CUSTOMERS
[Segment, region, or count. Be specific.]

TIMELINE
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]

CURRENT ACTIONS IN PROGRESS
- [What is being done right now, and by whom]

NEXT UPDATE
[HH:MM UTC]

WORKAROUNDS
[Any temporary steps customers or internal teams can take, or "None identified."]
INCIDENT REPORT (LIVE)

Incident ID:       [INC-XXXX]
Title:              [Short description of the issue]
Severity:           [Sev-1 / Sev-2 / Sev-3]
Status:             [Investigating / Identified / Mitigating / Monitoring / Resolved]
Incident Commander: [Name]
Start Time:         [YYYY-MM-DD HH:MM UTC]

CURRENT IMPACT
[One to two sentences: what is broken, how badly, and for whom.]

AFFECTED SERVICES
- [Internal service name][Customer-facing equivalent]
- [Internal service name][Customer-facing equivalent]

AFFECTED CUSTOMERS
[Segment, region, or count. Be specific.]

TIMELINE
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]

CURRENT ACTIONS IN PROGRESS
- [What is being done right now, and by whom]

NEXT UPDATE
[HH:MM UTC]

WORKAROUNDS
[Any temporary steps customers or internal teams can take, or "None identified."]

The postmortem template is a separate document with its own structure and cadence.

Where Incident Reporting Is Heading

Incident reports are shifting from documents humans write under pressure to artifacts agents draft and humans approve. The drafting becomes supervised; the skill that survives is judgment about what to communicate, to whom, and when.

Teams capturing every incident report as structured data today are building the institutional memory their agents will run on tomorrow. Every well-written report makes the next one easier, the postmortem more accurate, and the next incident faster to resolve. That compounding effect is the whole point.

If you want to see how that looks in practice, book a demo and watch Autoheal draft one from a live investigation.

Final Thoughts on Structuring Production Incident Reports

Getting your incident report template right matters more than most teams realize. Clear structure, disciplined cadence, and audience-specific versions build stakeholder trust during chaos and create institutional memory that compounds across every future incident. The skill that survives automation is judgment about what to communicate, to whom, and when.

FAQ

How to write an incident report at work?

Write the live report first, then the post-resolution report. During the incident, post your first internal update within 10 minutes declaring the incident, naming the incident commander, and stating the current status (investigating, identified, mitigating, monitoring, or resolved). Update on a fixed cadence based on severity: Sev-1 every 15 minutes, Sev-2 every 30 minutes, Sev-3 every hour. After resolution, publish the full report within 60 minutes covering what broke, how long it lasted, a timeline with decision traces, preliminary root cause labeled as such, and follow-up actions with owners.

Incident report vs postmortem: what's the difference?

An incident report is written during and immediately after the incident while under pressure, capturing real-time status and initial findings. A postmortem is written a few days later with full analysis, log data, and a complete 5-Why RCA. Treating them as one document means neither gets done well: the real-time report becomes bloated with speculation, and the postmortem loses urgency because teams think they already documented it.

Can AI agents write incident reports automatically?

Yes, but with human approval. Autoheal's agents draft internal status updates at your defined cadence and assemble the consolidated post-resolution report with timeline, impact summary, and preliminary root cause pulled directly from the investigation. The incident commander still owns the report and reviews every update before it ships. The agents remove the writing burden so commanders can focus on running the response instead of formatting updates under pressure.

How do you write an incident report for different audiences?

Create three versions from a single engineering source of truth. The engineering version includes full technical detail with service names, error rates, deploy IDs, and decision traces. The leadership version covers business impact, customer count affected, duration, and follow-up actions in plain language with zero stack traces. The customer version explains what was affected, when it started and ended, what they should do now, and prevention steps, with zero internal service names and zero speculation.

What's the biggest mistake when writing incident reports?

Skipping scheduled updates when there's nothing new to report. Silence reads as chaos to stakeholders, and they'll start asking questions in the wrong channels. Post on your defined cadence even if the update says "no change, still investigating." Consistency builds more trust than a detailed update that arrives late.