OOMKilled at 60% Heap: When the JVM Can't See the Container's Limit
A Java service kept getting OOMKilled by Kubernetes while its own heap graphs sat calmly at 60% with gigabytes to spare. The JVM was healthy by every metric it reported. The kernel killed it anyway, because the memory the JVM does not count, off-heap buffers, thread stacks, and metaspace, pushed the whole process past a container limit the JVM was never looking at.
A Spring service ran fine in staging and then started dying in production with exit code 137 every few hours. Code 137 is 128 plus 9, which is the process getting SIGKILL, and in a container that almost always means the kernel OOM killer. Kubernetes dutifully recorded the reason as OOMKilled and restarted the pod. The confusing part was that our heap dashboards were calm. Used heap hovered around 1.2 GB against a 2 GB max, garbage collection looked healthy, and there was no OutOfMemoryError in the logs anywhere. By every number the JVM reported about itself, the process had memory to spare. The kernel disagreed and won.
Two different definitions of "out of memory"
The first thing to untangle is that a Java OutOfMemoryError and a Linux OOMKilled are completely different events. An OutOfMemoryError is the JVM noticing that the heap is full and throwing inside your program. OOMKilled is the kernel noticing that the whole process, everything it has mapped, has exceeded the container's memory cgroup limit, and killing it from outside with no warning and no stack trace. Our heap was fine, so there was no Java error. But the process as a whole had blown past the 2 GB container limit, so the kernel reached in and killed it.
The mistake baked into our config was treating "container memory limit" and "JVM heap max" as the same number. We had set the pod's memory limit to 2 GB and -Xmx2g, figuring that capping the heap at the limit was safe. It is not safe, because the heap is only one part of what a JVM process consumes.
The memory the JVM does not put on the heap graph
A Java process uses a lot of memory that never appears on a heap chart. Thread stacks are off-heap, and at roughly 1 MB each a service with a few hundred threads is already hundreds of megabytes. Metaspace, where class metadata lives, is off-heap and grows with how many classes you load, which for a Spring app with a pile of dependencies is not small. Then there are direct byte buffers used by the network and IO layers, the garbage collector's own bookkeeping structures, JIT-compiled code cache, and memory the native allocator holds onto. Add it up and the non-heap overhead can easily be 500 MB to 1 GB on top of the heap.
container limit (cgroup): 2.0 GB <- the kernel kills past this
|
-Xmx (max heap): 2.0 GB <- the only number we capped
+ thread stacks (~300 x 1MB):0.3 GB
+ metaspace: 0.25 GB
+ direct buffers + GC + JIT: 0.4 GB
= actual process RSS: ~2.95 GB -> well over the limit
So with -Xmx2g inside a 2 GB container, the moment the heap actually grew toward its max under real load, total process memory sailed past 2 GB and the kernel killed us. In staging the heap never grew that far because traffic was lower, which is exactly why it only died in production.
The older trap: the JVM seeing the host, not the container
There is a related failure that bites people on older runtimes, and it is worth knowing because it produces the same symptom. Before JVM container awareness, the runtime read total memory from the host, not the cgroup limit. On a node with 64 GB of RAM, a JVM with no explicit -Xmx would default its max heap to a fraction of 64 GB, often around 16 GB, completely ignoring the pod's 2 GB limit. It would then happily grow the heap toward a ceiling far larger than the container allowed and get OOMKilled almost immediately under load. Modern JVMs read the cgroup limit and size defaults against it, but only if you are on a recent enough version and you have not pinned an -Xmx that contradicts the limit.
The knobs that looked like fixes
Raising the container limit to 3 GB while leaving -Xmx2g "fixed" it in the sense that the crashes stopped, but it is just buying headroom by guessing. Pick a number too tight and you crash again under a load spike. Pick it too loose and you are paying for memory you do not use and packing fewer pods per node. Lowering -Xmx blindly trades the OOM kill for heap pressure and more frequent garbage collection. Neither addresses the actual relationship, which is that the limit has to cover heap plus everything off-heap, and you have to leave room for the off-heap part on purpose.
The real fix
Stop sizing the heap as a fixed number and stop sizing it equal to the limit. Modern JVMs let you express the heap as a percentage of the container limit with -XX:MaxRAMPercentage, which automatically leaves the rest for off-heap use and tracks the limit if it changes. We set the heap to about 70% of the container memory and left the other 30% for stacks, metaspace, and buffers.
# container limit stays at 2Gi in the pod spec
# heap = 70% of the cgroup limit, the rest is off-heap headroom
JAVA_TOOL_OPTIONS="-XX:MaxRAMPercentage=70.0 -XX:InitialRAMPercentage=70.0"
# verify what the JVM actually decided, from inside the container:
java -XX:+PrintFlagsFinal -version | grep -i maxheapsize
# and confirm it reads the cgroup, not the host:
cat /sys/fs/cgroup/memory.max
The other half of the fix was measuring real usage instead of guessing the overhead. We watched the pod's actual container_memory_working_set_bytes, which is the number the OOM killer effectively cares about, under production load for a day, and set the request and limit from that observed number with a margin, rather than from the heap size alone. After that the working set settled around 1.7 GB against the 2 GB limit, the off-heap portion had real room, and the OOM kills stopped for good.
Why it hid
This is a quiet bug because every Java-native signal says the process is healthy. Heap usage, GC pauses, and the absence of any OutOfMemoryError all look fine, because the JVM is genuinely fine on the axis it measures. The failure lives in the gap between what the JVM counts as "memory used" and what the kernel counts, and the only place that gap is visible is the container's working set or RSS, which is a Kubernetes and OS metric, not a JVM one. If your dashboards only show heap, you are blind to the exact thing that is killing you. It hid in staging too, because lower traffic kept the heap from growing into the danger zone, so the overhead never tipped the total over the limit until real load arrived.
Rules of thumb
- Exit code 137 with no
OutOfMemoryErroris the kernel OOM killer, not the JVM. It means total process memory crossed the cgroup limit, which the JVM does not throw on. - The container memory limit must cover heap plus off-heap: thread stacks, metaspace, direct buffers, code cache, and GC structures. That overhead is commonly 500 MB to 1 GB.
- Never set
-Xmxequal to the container limit. Leave the off-heap portion real room, typically by capping the heap around 70% of the limit. - Prefer
-XX:MaxRAMPercentageover a fixed-Xmxso the heap tracks the cgroup limit instead of a hardcoded guess that drifts when limits change. - On older runtimes, a JVM with no
-Xmxmay size the heap against the host's RAM, not the container limit. Confirm container awareness and check the cgroup limit from inside the pod. - Alert on the container working set, not just heap. Heap graphs can look calm while the process is about to be killed for memory the JVM never reported.
- Size requests and limits from observed working set under real load, not from heap size alone, then leave a margin.