Diagnosing Spring Memory Leaks with Heap Dumps and GC Root Paths

Summary

A Spring-based web application experienced severe performance degradation, characterized by high swap usage and increasing memory consumption. An initial analysis using jmap -histo revealed a massive number of primitive byte arrays ([B) and String objects, but the output lacked the application-level context necessary to identify the specific business logic causing the leak. The system was entering a state of thrashing, where the OS spends more time moving memory to disk (swap) than executing code.

Root Cause

The primary issue in this scenario is not just a memory leak, but insufficient visibility into the object graph. While jmap -histo shows what is occupying memory, it fails to show why those objects are being held in memory.

Primitive Bloat: The top entry [B (byte arrays) consuming nearly 20MB in just the first few lines indicates heavy buffering or raw data processing.
Lack of Reference Chains: The histogram only shows counts and sizes. It does not show the GC Roots—the objects (like a static Map or a long-lived Spring Bean) that are preventing these byte arrays and strings from being garbage collected.
Implicit Leaks: In Spring applications, common culprits include:
- ThreadLocals not being cleared after request execution.
- Caching layers (like ConcurrentHashMap) growing unbounded.
- Session attributes accumulating large blobs of data.

Why This Happens in Real Systems

In production environments, memory issues rarely manifest as a single “smoking gun” object. Instead, they appear as fragmented growth:

High Object Cardinality: Systems process millions of small objects (Strings, Integers, Map Nodes). When these are accidentally attached to a long-lived lifecycle, the overhead of the object headers and pointers becomes significant.
Resource Exhaustion via Swap: When the JVM heap grows, the OS tries to maintain stability by moving “cold” memory pages to Swap. Once swap is heavily used, the JVM’s Stop-the-World (STW) garbage collection cycles take exponentially longer because the GC must wait for disk I/O to retrieve objects.
Abstraction Layers: Modern frameworks like Spring and Netty add layers of abstraction (e.g., ByteBuf, ResourceEntry). This makes raw heap dumps look like a sea of framework internal classes rather than business objects.

Real-World Impact

Latency Spikes: Increased GC pause times lead to request timeouts and “flapping” health checks.
Cascading Failures: As one instance begins swapping, its response time increases, causing upstream services to retry, which further increases the load on the struggling instance.
OOM Killer Intervention: Eventually, the Linux Out-Of-Memory (OOM) Killer will target the process with the highest resident set size (RSS), causing an abrupt and ungraceful shutdown.

Example or Code (if necessary and relevant)

To move beyond a histogram, you must capture a full Heap Dump and analyze the Path to GC Roots.

# 1. Capture a full heap dump when memory is high
jmap -dump:live,format=b,file=heapdump.hprof 

# 2. Use jcmd to check memory usage statistics
jcmd  GC.heap_info

# 3. Check for specific large objects using jmap (histogram version)
jmap -histo:live  | head -n 20

How Senior Engineers Fix It

Senior engineers move from “what is the size” to “who owns this object.”

Heap Dump Analysis: Use tools like Eclipse MAT (Memory Analyzer Tool) or VisualVM. The most critical step is using the “Path to GC Roots” feature to find the object holding the reference.
Leak Suspect Reports: In MAT, run the “Leak Suspects Report” to automatically identify large accumulation points.
Allocation Profiling: If the leak is hard to find, use async-profiler or JProfiler in a staging environment to track where in the code the allocations are happening in real-time.
Observability: Implement Micrometer metrics to track heap usage (Eden, Survivor, Old Gen) and alert when the Old Generation fails to decrease after a Full GC.

Why Juniors Miss It

Focusing on the Symptom, Not the Cause: Juniors often try to “fix” the issue by increasing -Xmx (Max Heap Size). This merely delays the inevitable and actually makes the problem worse by increasing the time spent in GC pauses.
Misinterpreting Histograms: They see java.lang.String at the top and assume it is a “Java problem,” rather than looking for the holding collection (e.g., a HashMap inside a Service bean).
Ignoring the OS Layer: They overlook the significance of Swap usage. A high swap rate is a critical signal that the application has exceeded the physical memory limits of the container or VM, regardless of what the JVM heap settings say.