How long should an incident report be?

A live incident report should be as short as possible while answering three questions: what's broken, who's affected, and what's the current status. The post-resolution report typically runs one to two pages covering overview, timeline, preliminary root cause, and follow-up actions. If stakeholders can't extract the key facts in under 30 seconds, the report is too long or poorly structured.

What's the difference between a workplace incident report and a production incident report for SRE teams?

A workplace incident report documents physical safety events like injuries or equipment damage for HR and compliance purposes. A production incident report documents service outages, performance degradations, and system failures for engineering teams. Both follow structured formats, but production reports focus on technical root cause, system impact, and preventive fixes rather than injury documentation.

Simple incident report sample vs detailed: which format should I use?

Use a simple format during the active incident with status, impact, timeline, and next update time. Switch to a detailed format for the post-resolution report that adds preliminary root cause, evidence summary, and follow-up actions with owners. The simple version keeps stakeholders informed under pressure; the detailed version becomes the institutional record.

Can you write an incident report template that works for both engineering and leadership?

No, one template for both audiences fails. Create separate versions from a single engineering source of truth: the engineering template includes service names, error rates, and decision traces, while the leadership template translates impact into business terms with customer count affected and revenue implications. Trying to serve both with one document means neither audience gets what they need.

Incident report vs runbook: what's the relationship?

An incident report documents what happened during a specific outage. A runbook documents the repeatable procedure for investigating or fixing a known failure pattern. Well-written incident reports with decision traces become the source material for generating and updating runbooks, so each resolved incident improves the team's procedural knowledge for next time.

How do you write an effective incident report when the root cause is still unknown?

Label your root cause section as "preliminary" explicitly and state what you know versus what you're still investigating. Post the report with the incomplete sections marked clearly rather than waiting for perfect information. Stakeholders trust transparency about uncertainty more than delayed updates that claim false confidence.

What should you never include in a customer-facing incident report?

Never include internal service names, infrastructure details, employee names, speculation about root cause, or blame language. Customers need to know what was affected, when it started and ended, what they should do now, and what you're doing to prevent recurrence. Everything else is internal context that erodes customer confidence when exposed.

Free workplace incident report sample PDF vs building your own: which is better?

Build your own template tailored to your specific services, stakeholder needs, and communication cadence rather than using a generic PDF. Generic templates force you into someone else's structure and miss critical fields your team needs like affected service mappings, customer segment impact, and integration with your incident management system.

How often should you update an incident report during an active Sev-1?

Update every 15 minutes during a Sev-1, even if nothing has changed. Post "no change, still investigating" at the scheduled time rather than skipping updates. Silence during an active incident reads as chaos to stakeholders, and they'll flood your DMs asking for status instead of letting you focus on resolution.

What's the best way to learn how to write incident reports if you've never done it before?

Start by reading your team's past incident reports and postmortems to understand the expected structure and level of detail. Shadow an experienced incident commander during a live incident to see how they draft updates under pressure. Practice writing status updates for non-critical incidents first before you're responsible for a Sev-1 communication stream.

Introducing Autoheal, the AI for Site Reliability Engineering

Introducing Autoheal, the AI for
Production Engineering

autoheal

Blog

About Us

Book a demo

autoheal

Blog

About Us

Book a demo

autoheal

How to Write an Incident Report: A Step-by-Step Guide for SRE Teams (May 2026)

Learn how to write incident reports for SRE teams with this step-by-step guide. Cover what's broken, who's affected, and status updates. May 2026.

May 11, 2026

Every incident commander knows the moment: you've just declared a Sev-1, your team is triaging, and someone from leadership or customer success asks "what's the status?" If you don't have a clear answer ready in the next five minutes, you'll spend the rest of the incident fielding DMs instead of coordinating the fix. The difference between teams that handle how to write an incident report well and teams that don't comes down to one thing: they start writing the moment the incident's declared, not after it's resolved. That live document becomes the source of truth everyone pulls from, and the post-resolution report writes itself from there.

TLDR:

Incident reports document what's broken, who's affected, and current status during active outages.
Post status updates every 15-30 minutes during incidents, even if nothing changed, to prevent stakeholder chaos.
Write three versions: engineering gets full technical detail, leadership gets business impact, customers get zero internal service names.

What an Incident Report Is

An incident report is the written record produced during and after a production incident. It captures four things: what is happening, who is affected, what the team is doing about it, and when the next update will arrive.

This isn't a design doc, a changelog, or a Jira ticket. It's a living communication artifact that starts the moment an incident is declared and continues through resolution. In its earliest form, it's a status update for responders and stakeholders. In its final form, it becomes the authoritative account of what went wrong, what was done, and what should change.

If you're writing one for the first time, think of it as two documents stitched together: the real-time log you keep while the incident is active, and the structured report you write once the dust settles.

Incident Report vs Postmortem

People conflate these two documents constantly, and it causes problems. An incident report is written under pressure, while the incident is still active or freshly resolved. A postmortem is written a few days later, with the benefit of hindsight, log analysis, and a 5-Why RCA.

When teams treat them as one document, neither gets done well. The real-time report becomes bloated with root cause speculation. The postmortem loses urgency because "we already wrote it up." Both suffer.

This guide focuses on the incident report. If you're looking for postmortem guidance, that's a separate discipline with its own structure and timeline. Here, we're covering what you write during and immediately after the event.

What a Good Incident Report Contains

Every incident report should answer three questions in its first three lines: what's broken, who's affected, and what's the current status. If a VP opens the doc and can't answer those in five seconds, the report has failed.

Below the fold, include:

Incident ID and severity level
Incident commander name
Start time (UTC)
Affected services and customer segments
Impact summary in one to two sentences
Running timeline of actions taken
Next scheduled update time

Status fields matter more than most teams realize. "Investigating" means you don't know the cause yet. "Identified" means you know the cause but haven't acted. "Mitigating" means a fix is in progress. "Monitoring" means the fix is deployed and you're watching. "Resolved" means it's over. Pick one. Don't improvise.

Tone discipline is the hardest part. Write in present tense, state facts, and skip speculation. "API latency exceeds 2s for 40% of requests" works. "We think the database might be overloaded" does not.

Step-by-Step: Writing the Live Incident Report

When the alarm fires, your reporting clock starts. The Google SRE incident management guide stresses the importance of structured response protocols. Here's the sequence:

Declare the incident within 5 minutes. A late declaration means a late first update, which means stakeholders start asking questions in the wrong channels.
Name the incident commander. One person owns communication. No exceptions. The incident commander role is responsible for coordinating all stakeholder communication while the technical team focuses on resolution.
Post the first internal update within 10 minutes. Even if it says "investigating, no root cause identified yet," that update matters.
Send the first external update within 15 to 30 minutes, depending on severity and your SLA commitments.
Update on a fixed cadence. Sev-1: every 15 minutes. Sev-2: every 30 minutes. Sev-3: every hour. If there's nothing new to report, say so. Consistency matters more than novelty.
When the fix deploys, distinguish between "fix in progress" and "fix deployed, monitoring." These are different states and stakeholders need to know which one they're in.
Publish the resolution update. Confirm the incident is over, summarize impact, and note when the full report will follow.

The worst updates are the ones that never arrive. A brief "no change, still investigating" post at the scheduled time builds more trust than a detailed update that shows up 20 minutes late.
Severity Level
Update Cadence
First Internal Update
First External Update
Required Audiences
Communication Scope
Sev-1
Every 15 minutes
Within 5 minutes of declaration
Within 15 minutes
Engineering, leadership, customer success, public status page
Full timeline with decision traces, business impact quantified, customer-facing language with zero internal service names
Sev-2
Every 30 minutes
Within 10 minutes of declaration
Within 30 minutes
Engineering, leadership, customer success, status page if customer-facing
Technical details for engineering, impact summary for leadership, selective customer communication
Sev-3
Every 60 minutes
Within 15 minutes of declaration
Within 60 minutes or at resolution
Engineering, leadership notification
Internal engineering details, leadership notification without requiring action, minimal customer communication

Severity Level	Update Cadence	First Internal Update	First External Update	Required Audiences	Communication Scope
Sev-1	Every 15 minutes	Within 5 minutes of declaration	Within 15 minutes	Engineering, leadership, customer success, public status page	Full timeline with decision traces, business impact quantified, customer-facing language with zero internal service names
Sev-2	Every 30 minutes	Within 10 minutes of declaration	Within 30 minutes	Engineering, leadership, customer success, status page if customer-facing	Technical details for engineering, impact summary for leadership, selective customer communication
Sev-3	Every 60 minutes	Within 15 minutes of declaration	Within 60 minutes or at resolution	Engineering, leadership notification	Internal engineering details, leadership notification without requiring action, minimal customer communication

Step-by-Step: Writing the Post-Resolution Report

Once the incident is resolved, you have 60 minutes to publish the post-resolution report. This is the canonical record that lives in your incident management system and feeds directly into the postmortem days later.

The report needs four things:

An overview: what broke, how long it lasted, who was affected.
A timeline with decision traces showing what was done and why, beyond simply when.
A preliminary root cause. Label it "preliminary" explicitly. If you're wrong, nobody's surprised when the postmortem revises the story. If you hide uncertainty now, stakeholders lose trust later.
Follow-up actions with owners and links to relevant tickets, dashboards, or runbooks.

Distribute to predefined audiences within four hours. Engineering, customer success, leadership, and support should all have versions waiting. The longer this document sits in a private channel, the more likely someone fills the vacuum with their own narrative.

The Three Audience Versions

Most teams write one incident report and copy-paste it to every audience. This destroys trust in two directions: leadership gets technical jargon they can't act on, and customers get internal service names they shouldn't see.

You need three versions:

Engineering: full technical detail with service names, error rates, deploy IDs, queries run, and decision traces. This is the unfiltered record.
Leadership: business impact, customer count affected, duration, estimated revenue impact if known, and follow-up actions in plain language. No stack traces.
Customers: what was affected, when it started and ended, what they should do now, and what you're doing to prevent recurrence. Zero internal service names, zero speculation.

The engineering version is your source of truth. The other two are adapted from it, not written from scratch. Get this wrong and you'll spend more time fielding confused Slack messages from your VP than you spent fixing the incident itself.

Common Mistakes That Ruin Incident Reports

These seven mistakes show up repeatedly, and any one of them can undermine an otherwise solid report:

Speculating about root cause before mitigation is underway. Stakeholders latch onto your guess, and correcting it later feels like backtracking.
Skipping updates when there's nothing new. Silence reads as chaos. Post on cadence, even if the content is "no change."
Mixing internal and external language. A customer-facing update that references internal service names erodes confidence instantly.
Burying impact below fix details. Your audience needs "what's broken" before "what we're doing."
Forgetting to include the next update time. Without it, every recipient invents their own follow-up cadence in your DMs.
Closing the report without naming uncertainty. If the preliminary root cause might change, say so. Hidden ambiguity becomes a trust problem.
Treating the incident report as the postmortem. The report captures what happened. The postmortem captures why. Collapsing them guarantees neither gets the depth it needs.

How AI Agents Auto-Generate Incident Reports

Everything described in the previous sections is exactly what Autoheal's agents already do during an active incident.

Because Autoheal runs incident orchestration natively inside Slack and Teams, the agents have full visibility into every human decision trace: who triaged what, which hypotheses were rejected, and which fix was approved. The Production Context Graph maps affected services, recent deploys, and customer cohorts in real time, so the agent already knows the blast radius before anyone asks.

During the incident, agents draft internal status updates at your defined cadence and queue them for the incident commander's review. At resolution, they assemble the consolidated report with timeline, impact summary, and preliminary root cause pulled directly from the investigation. The commander still owns the report. The agent removes the writing burden so they can focus on running the response.

Incident Report Template

Copy this into your incident channel or doc and fill in the blanks as the incident progresses.

INCIDENT REPORT (LIVE)

Incident ID:       [INC-XXXX]
Title:              [Short description of the issue]
Severity:           [Sev-1 / Sev-2 / Sev-3]
Status:             [Investigating / Identified / Mitigating / Monitoring / Resolved]
Incident Commander: [Name]
Start Time:         [YYYY-MM-DD HH:MM UTC]

CURRENT IMPACT
[One to two sentences: what is broken, how badly, and for whom.]

AFFECTED SERVICES
- [Internal service name] → [Customer-facing equivalent]
- [Internal service name] → [Customer-facing equivalent]

AFFECTED CUSTOMERS
[Segment, region, or count. Be specific.]

TIMELINE
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]

CURRENT ACTIONS IN PROGRESS
- [What is being done right now, and by whom]

NEXT UPDATE
[HH:MM UTC]

WORKAROUNDS
[Any temporary steps customers or internal teams can take, or "None identified."]

INCIDENT REPORT (LIVE)

Incident ID:       [INC-XXXX]
Title:              [Short description of the issue]
Severity:           [Sev-1 / Sev-2 / Sev-3]
Status:             [Investigating / Identified / Mitigating / Monitoring / Resolved]
Incident Commander: [Name]
Start Time:         [YYYY-MM-DD HH:MM UTC]

CURRENT IMPACT
[One to two sentences: what is broken, how badly, and for whom.]

AFFECTED SERVICES
- [Internal service name] → [Customer-facing equivalent]
- [Internal service name] → [Customer-facing equivalent]

AFFECTED CUSTOMERS
[Segment, region, or count. Be specific.]

TIMELINE
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]

CURRENT ACTIONS IN PROGRESS
- [What is being done right now, and by whom]

NEXT UPDATE
[HH:MM UTC]

WORKAROUNDS
[Any temporary steps customers or internal teams can take, or "None identified."]

INCIDENT REPORT (LIVE)

Incident ID:       [INC-XXXX]
Title:              [Short description of the issue]
Severity:           [Sev-1 / Sev-2 / Sev-3]
Status:             [Investigating / Identified / Mitigating / Monitoring / Resolved]
Incident Commander: [Name]
Start Time:         [YYYY-MM-DD HH:MM UTC]

CURRENT IMPACT
[One to two sentences: what is broken, how badly, and for whom.]

AFFECTED SERVICES
- [Internal service name] → [Customer-facing equivalent]
- [Internal service name] → [Customer-facing equivalent]

AFFECTED CUSTOMERS
[Segment, region, or count. Be specific.]

TIMELINE
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]

CURRENT ACTIONS IN PROGRESS
- [What is being done right now, and by whom]

NEXT UPDATE
[HH:MM UTC]

WORKAROUNDS
[Any temporary steps customers or internal teams can take, or "None identified."]

INCIDENT REPORT (LIVE)

Incident ID:       [INC-XXXX]
Title:              [Short description of the issue]
Severity:           [Sev-1 / Sev-2 / Sev-3]
Status:             [Investigating / Identified / Mitigating / Monitoring / Resolved]
Incident Commander: [Name]
Start Time:         [YYYY-MM-DD HH:MM UTC]

CURRENT IMPACT
[One to two sentences: what is broken, how badly, and for whom.]

AFFECTED SERVICES
- [Internal service name] → [Customer-facing equivalent]
- [Internal service name] → [Customer-facing equivalent]

AFFECTED CUSTOMERS
[Segment, region, or count. Be specific.]

TIMELINE
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]
[HH:MM UTC] - [Event or action taken]

CURRENT ACTIONS IN PROGRESS
- [What is being done right now, and by whom]

NEXT UPDATE
[HH:MM UTC]

WORKAROUNDS
[Any temporary steps customers or internal teams can take, or "None identified."]

The postmortem template is a separate document with its own structure and cadence.

Where Incident Reporting Is Heading

Incident reports are shifting from documents humans write under pressure to artifacts agents draft and humans approve. The drafting becomes supervised; the skill that survives is judgment about what to communicate, to whom, and when.

Teams capturing every incident report as structured data today are building the institutional memory their agents will run on tomorrow. Every well-written report makes the next one easier, the postmortem more accurate, and the next incident faster to resolve. That compounding effect is the whole point.

If you want to see how that looks in practice, book a demo and watch Autoheal draft one from a live investigation.

Final Thoughts on Structuring Production Incident Reports

Getting your incident report template right matters more than most teams realize. Clear structure, disciplined cadence, and audience-specific versions build stakeholder trust during chaos and create institutional memory that compounds across every future incident. The skill that survives automation is judgment about what to communicate, to whom, and when.

FAQ

How to write an incident report at work?

Write the live report first, then the post-resolution report. During the incident, post your first internal update within 10 minutes declaring the incident, naming the incident commander, and stating the current status (investigating, identified, mitigating, monitoring, or resolved). Update on a fixed cadence based on severity: Sev-1 every 15 minutes, Sev-2 every 30 minutes, Sev-3 every hour. After resolution, publish the full report within 60 minutes covering what broke, how long it lasted, a timeline with decision traces, preliminary root cause labeled as such, and follow-up actions with owners.

Incident report vs postmortem: what's the difference?

An incident report is written during and immediately after the incident while under pressure, capturing real-time status and initial findings. A postmortem is written a few days later with full analysis, log data, and a complete 5-Why RCA. Treating them as one document means neither gets done well: the real-time report becomes bloated with speculation, and the postmortem loses urgency because teams think they already documented it.

Can AI agents write incident reports automatically?

Yes, but with human approval. Autoheal's agents draft internal status updates at your defined cadence and assemble the consolidated post-resolution report with timeline, impact summary, and preliminary root cause pulled directly from the investigation. The incident commander still owns the report and reviews every update before it ships. The agents remove the writing burden so commanders can focus on running the response instead of formatting updates under pressure.

How do you write an incident report for different audiences?

Create three versions from a single engineering source of truth. The engineering version includes full technical detail with service names, error rates, deploy IDs, and decision traces. The leadership version covers business impact, customer count affected, duration, and follow-up actions in plain language with zero stack traces. The customer version explains what was affected, when it started and ended, what they should do now, and prevention steps, with zero internal service names and zero speculation.

What's the biggest mistake when writing incident reports?

Skipping scheduled updates when there's nothing new to report. Silence reads as chaos to stakeholders, and they'll start asking questions in the wrong channels. Post on your defined cadence even if the update says "no change, still investigating." Consistency builds more trust than a detailed update that arrives late.