OOM Killed: How SREs Actually Debug Memory Failures (June 2026)
Learn how SREs actually debug OOM killed memory failures, from kernel logs to cgroup limits. Real debugging methods for exit code 137 issues. May 2026.
The oom killer fires. Your pod shows exit code 137 and reason OOMKilled in kubectl describe. You assume the Java process inside leaked memory, so you start profiling the heap. Two hours later you find out a noisy neighbor on the same node exhausted physical RAM and the kernel picked your container because it had the highest oom score. The OOM killed pod was a victim, not the cause. The real problem was host-level memory pressure, and the only way to know that for sure is reading the actual oom killer logs in dmesg before you start fixing anything.
TLDR:
Exit code 137 means SIGKILL, but doesn't prove OOM: check
dmesgor/var/log/kern.logfor kernel entries.Host-level vs. cgroup OOM kills require different fixes: one needs node resizing, the other container limits.
The killed process isn't always the culprit: scan the full RSS table for memory hogs before debugging.
JVM, Go, Node, and Python all leak off-heap memory outside their runtime limits, breaching cgroup ceilings.
Autoheal links kernel logs, cgroup limits, and deploy history via the PCG to inherit reasoning from past failures.
Confirming It Was Actually an OOM Kill
A crashed process and a killed process look identical from the outside. Both disappear. The first job is confirming the Out of Memory (OOM) killer actually pulled the trigger.
On a Linux host, run dmesg | grep -i "oom" or check /var/log/kern.log. If the kernel invoked the OOM killer, you'll see a line containing Out of memory: Killed process followed by the process name and PID. No log entry, no OOM kill.
In Kubernetes, kubectl describe pod <name> surfaces a lastState block with reason OOMKilled and exit code 137. That number is 128 plus signal 9 (SIGKILL). But here's the caveat worth remembering: any SIGKILL produces a 137. A preemption, a manual kill -9, or a deployment timeout all leave the same exit code. Always cross-reference the kernel log before assuming OOM. The exit code tells you how the process died. Only dmesg tells you why.
Two Different OOM Kills: Kernel vs. Cgroup Limit
Once you've confirmed an OOM kill happened, the next question is which kind. There are two, and they point to completely different problems.
A host-level OOM kill fires when the entire node exhausts physical memory, similar to the scenarios covered in runbook documentation. The kernel's OOM killer picks a victim based on oom_score, weighing memory consumption, process age, and the oom_score_adj value. Your container might be perfectly within its own limits and still get killed because a noisy neighbor consumed everything else on the node.
A cgroup OOM kill is narrower. The container hits the memory ceiling defined in its own cgroup, and the cgroup memory controller terminates it while the node still has gigabytes free. The kernel log will reference memory cgroup out of memory rather than a system-wide Out of memory event.
Both produce OOMKilled in kubectl describe pod. Both yield exit code 137. But the fixes diverge sharply: a host-level kill means you need to resize the node, evict workloads, or tune oom_score_adj to protect critical processes. A cgroup kill means the container's memory limit is too low, or the process inside has a leak. Confusing the two sends you debugging the wrong layer entirely.
Reading the OOM Killer's Report
When the OOM killer fires, it dumps a detailed report into the kernel log. The first line tells you which type you're dealing with: Memory cgroup out of memory points to a container hitting its own ceiling, while a bare Out of memory means the node itself ran dry. Previous sections covered why that distinction matters, so here we'll focus on what comes next in the output.
Below the header, the kernel prints a table of candidate processes showing each one's PID, UID, RSS (resident set size), and oom_score_adj, which helps avoid alert fatigue during incident triage. The process with the highest computed oom_score gets selected. RSS is the column that matters most: it tells you how much physical memory each process actually held at the moment the kill decision was made.
Here's the trap most engineers fall into: they assume the killed process is the one that caused the problem. Often it isn't. The OOM killer selects the largest eligible victim to reclaim the most memory in one shot. A small, misbehaving process can slowly exhaust available memory, and the kernel responds by killing your database or web server because those hold the biggest RSS footprint. Always scan the full candidate table for processes with suspiciously high or rapidly growing RSS, not only the one that caught the SIGKILL.
Kubernetes Specifics: Requests, Limits, and QoS
When you set a memory limit on a container spec, Kubernetes translates it directly into a cgroup memory ceiling. That's the boundary covered earlier. Requests, on the other hand, affect scheduling: the kubelet uses them to decide which node can fit the pod, but they don't cap consumption at runtime. A pod requesting 256Mi and limiting at 512Mi gets scheduled based on 256Mi, then allowed to consume up to 512Mi before the cgroup kills it.
Kubernetes assigns each pod a Quality of Service (QoS) class based on how requests and limits are configured, a key consideration for AI Site Reliability Engineering workflows. Guaranteed pods have requests equal to limits for every container. Burstable pods have at least one request set but don't match limits across the board. BestEffort pods specify neither. The kubelet maps these classes to oom_score_adj values: Guaranteed gets -997, BestEffort gets 1000, and Burstable falls somewhere between. Under node memory pressure, the kernel kills BestEffort pods first.
A pod with no memory limit isn't safe. It's invisible to the cgroup controller. When node pressure spikes, the kubelet's eviction manager targets it based on QoS class, and because BestEffort pods carry the highest oom_score_adj, they're first in line.
The tradeoff is real: limits set too low trigger constant OOM kills on healthy workloads, while limits set too high let a leaking process consume memory for hours before anything notices. Neither failure mode is obvious from kubectl describe pod alone.
Runtime-Specific Causes: JVM, Go, Node, Python
Each runtime has its own memory model, and each creates distinct OOM failure patterns.
Java Virtual Machine (JVM) processes often trigger the OOM killer despite having internal heap limits because off-heap allocations (thread stacks, direct byte buffers, native memory from JNI) grow outside the
-Xmxboundary. A container with 4 GB allocated and-Xmx3gset can still breach its cgroup limit once metaspace and thread overhead are factored in, extending mean time to resolution during incidents.Go's garbage collector doesn't enforce a hard memory cap by default. The
GOMEMLIMITsoft target (introduced in Go 1.19) helps, but goroutine leaks and unbounded slice growth still push RSS well past expectations, especially under bursty load.Node.js relies on V8's heap limit, configurable via
--max-old-space-size. Buffer allocations and native addons sit outside that limit, creating the same off-heap blind spot JVM engineers face.Python's reference-counting collector rarely returns memory to the OS after freeing objects, inflating resident set size over long-running processes. Extensions written in C can leak without any visibility from Python-level profiling.
Runtime | Configured Memory Limit | Common Off-Limit Cause |
|---|---|---|
JVM | Heap size set via | Thread stacks, direct byte buffers, native memory from JNI, and metaspace grow outside the heap boundary |
Go | Soft target via | Goroutine leaks and unbounded slice growth push RSS past expectations under bursty load |
Node.js | V8 heap limit via | Buffer allocations and native addons consume memory outside the V8 heap ceiling |
Python | No built-in runtime memory ceiling | Reference-counting collector rarely returns memory to the OS after freeing objects, and C extensions leak without Python-level visibility |
Distinguishing a Leak from Undersizing
After an OOM kill, the fix depends entirely on one question: does RSS grow unbounded over time, or does it spike only under expected load?
Plot container memory usage across several hours or days. If RSS climbs steadily between requests, independent of traffic, you're looking at a leak. The process accumulates memory it never releases, and raising the limit only postpones the next kill. Restart cadence confirms the pattern: a pod that gets OOM killed at roughly the same interval after each restart, regardless of load, is leaking.
If RSS tracks proportionally with request volume and stabilizes when traffic flattens, the container is undersized for its workload. Raising the limit or scaling horizontally is the correct response. The distinction matters because one answer sends you into heap profiling and allocation tracing, while the other sends you into capacity planning.
Where AI-Assisted Debugging Fits
Most of the steps covered in this article involve jumping between tools: dmesg output, kubectl describe, Grafana dashboards, deployment histories, runtime configs, context that coding agents struggle with during P1 incidents. An AI agent can pull all of those sources in parallel, link the kernel log entry with the specific cgroup, match it against a recent deploy, and surface the RSS trend in seconds. That compression matters most at 3 AM, when the human on call is context-switching across six browser tabs.
The honest limit is straightforward. An agent reasons over what it can read. If it lacks access to the kernel log or the container's memory metrics, it's guessing with the same blind spots you'd have. Granting that read access is a governance decision for autonomous agents, not a convenience toggle, especially in environments with strict data boundaries.
How Autoheal Correlates OOM Kills Across the Stack
The method stays the same regardless of tooling: identify which memory boundary was crossed, confirm whether the cause is a leak or undersizing, then act. Autoheal, as AI for SRE, runs that correlation across kernel signals, cgroup limits, recent deploys, and runtime context with the governance controls production requires. The Production Context Graph (PCG) captures decision traces from every past OOM resolution, so agent investigation #400 inherits the reasoning from every prior memory failure. That's compounding institutional memory, not a one-shot diagnostic.
Final Thoughts on Resolving OOM Kills
The playbook doesn't change: confirm the kill type, read the kernel's victim selection logic, check whether RSS growth tracks with load or with time, then act. What does change is how fast you can move through those steps when the tooling stops making you context-switch between six different data sources. Your next OOM kill will happen, and when it does, you'll either be ready with the full picture or you'll be grep-ing logs at 3 AM. Book a demo to see how Autoheal correlates all of it without the manual assembly work.
Frequently Asked Questions About OOM Kills
What's the difference between a kernel OOM kill and a cgroup OOM kill?
A kernel OOM kill fires when the entire node exhausts physical memory, selecting victims based on oom_score across all running processes. A cgroup OOM kill fires when a single container hits its own memory limit while the node still has available memory. Both produce exit code 137 and show OOMKilled in Kubernetes, but kernel kills require node-level fixes (resize, evict workloads, tune oom_score_adj) while cgroup kills require container-level fixes (raise limits or fix memory leaks).
How do I check OOM killer logs on Linux?
Run dmesg | grep -i "oom" or check /var/log/kern.log for entries containing Out of memory: Killed process followed by the process name and PID. For Kubernetes pods, use kubectl describe pod <name> to check the lastState block for reason OOMKilled. Always cross-reference both sources because exit code 137 can result from any SIGKILL, not only OOM kills.
Can I tell if a memory leak caused the OOM kill or if the container was just undersized?
Plot container memory usage (RSS) across several hours or days. If RSS climbs steadily between requests, independent of traffic, you have a leak. If RSS tracks proportionally with request volume and stabilizes when traffic flattens, the container is undersized. A pod OOM killed at roughly the same interval after each restart, regardless of load, confirms a leak.
Why was my container killed when it was within its memory limit?
Your container likely hit a host-level OOM kill, not a cgroup kill. Even if your container is within its own limit, the kernel's OOM killer targets processes when the entire node exhausts physical memory. Check dmesg for Out of memory (host-level) versus memory cgroup out of memory (container-level) to confirm which boundary was crossed.
Does Kubernetes QoS class affect which pod gets killed first during memory pressure?
Yes. Kubernetes assigns each pod a QoS class (Guaranteed, Burstable, or BestEffort) based on memory requests and limits, then maps these to oom_score_adj values. BestEffort pods receive oom_score_adj 1000 and are killed first under node pressure, Guaranteed pods receive -997 and are protected, and Burstable pods fall in between. The kubelet evicts based on these scores when memory pressure triggers.
