Introducing Autoheal, the AI for Production Engineering

Introducing Autoheal, the AI for
Production Engineering

Making Autohealing Production A Reality

The future of engineering is autohealing production. So what does such a future look like?

Every production incident costs you twice. There's the downtime itself, and then there's the time your engineers spend responding to it. The second cost is the one that doesn't show up on your cloud bill, but you can see it in attrition, in slower feature work, and in the late-night Slack threads your best engineers wish they weren't part of.

Most of the AI tools being built for ops today are focused on the wrong half of the problem. They help your on-call team respond faster, which is useful, but it's still just a response. We're building Autoheal around a different idea: production should heal itself over time, and each incident should make the next one cheaper to handle.

Our previous post The path to self-driving production laid down a roadmap on how to get there. L0 is where most enterprises today where AI usage is ad hoc when it comes to production alert and incident response. L1 is where a few engineers have repurposed local coding agents to root cause production incidents but alert response remains unchanged. The journey to self-driving, autohealing production truly begins at L2 where always-on background agents take on the first responder role in an on-call rotation and investigate alerts/incidents even before the on-call engineer is ready. L3 is where institutional memory starts compounding from every alert and incident investigated. And L4 is where autohealing production becomes a reality. This post dives deeper into L2 to L4 in order to show what exactly changes.

L2: The Healing Loop


In this level, when something breaks in production, an agent picks up the alert. It looks at the logs and metrics, checks the production context graph for what it already knows about your system, uses an AI model to reason about what's happening, and brings in a human expert when it needs to. Then it applies a mitigation.

This is roughly where most teams are today, or where they're trying to get to. It works well enough. Your time to recovery drops, your engineers sleep better, and incidents become less stressful overall. But it's also expensive, because every incident burns model tokens, expert time, and a lot of queries against production systems. The cost of the hundredth incident ends up looking a lot like the cost of the first.

L3: The Continuous Learning Loop


The Healing Loop is just the starting point. The more interesting part of Autoheal is what happens after the incident is closed out.

Every incident is essentially a free training signal, and Autoheal turns each one into 5 kinds of actions. 

  • There's a Fix PR that patches the underlying code so the bug can't happen again. 

  • There's an Observability PR that adds whatever log line or metric would have helped you detect the issue faster this time. 

  • There's an alert-tuning change that adjusts thresholds or dedupes noisy alerts so you only get paged when something actually matters. 

  • There’s a decision trace addition that captures net new information learnt during the incident from experts which was not present in the context before. This is where institutional memory finally gets captured after years of being there in engineers’ minds or slack thread or jira comment.

  • And there's a Skill PR that captures the diagnostic and mitigation steps as a reusable skill, written into the production context graph.

This is the loop most teams skip, and skipping it is what keeps incident response expensive forever. With this loop running, every incident makes the system a little better at handling the next one, either by preventing the failure mode entirely or by making the next response faster and cheaper.

L4: The Autohealing Loop


After enough cycles through the previous two levels, the shape of the diagram starts to change. The production context graph has accumulated enough skills, decision traces, and catalog data that it can do most of the work itself. The agent reaches for the context graph first, calls the AI model with fewer tokens, and rarely needs to ask an expert at 3am.

From the outside, the behavior looks more or less the same. An alert comes in, a mitigation goes out. But the underlying economics have shifted. An incident that would have triggered a Sev-1 war room six months ago now tends to resolve itself before anyone notices. Humans only get involved in failure modes the system genuinely hasn't seen before, and those become the next round of training data.

What this means for engineering leaders

Most AI ops budgets right now are getting spent on L2, on the assumption that faster response is the destination. We don't think it is. L2  buys you marginal improvements in time to recovery at a steady marginal cost. L3 is where the compounding actually starts, and L4 is where on-call stops being a tax on your engineering organization.

If you're evaluating tools in this space, the more useful question isn't whether the tool can respond. The better question is what the loop looks like 30 days in. Is the system getting cheaper to run as it sees more incidents? Is your page rate trending down? Is the context graph actually growing? If the answer to those is no, you're paying for an assistant. If the answer is yes, you're paying for autonomy.

That's what we're building Autoheal to be.