Diagnosing Spring Memory Leaks with Heap Dumps and GC Root Paths

Summary

A Spring-based web application experienced severe performance degradation, characterized by high swap usage and increasing memory consumption. An initial analysis using jmap -histo revealed a massive number of primitive byte arrays ([B) and String objects, but the output lacked the application-level context necessary to identify the specific business logic causing the leak. The system was entering a state of thrashing, where the OS spends more time moving memory to disk (swap) than executing code.

Root Cause

The primary issue in this scenario is not just a memory leak, but insufficient visibility into the object graph. While jmap -histo shows what is occupying memory, it fails to show why those objects are being held in memory.

  • Primitive Bloat: The top entry [B (byte arrays) consuming nearly 20MB in just the first few lines indicates heavy buffering or raw data processing.
  • Lack of Reference Chains: The histogram only shows counts and sizes. It does not show the GC Roots—the objects (like a static Map or a long-lived Spring Bean) that are preventing these byte arrays and strings from being garbage collected.
  • Implicit Leaks: In Spring applications, common culprits include:
    • ThreadLocals not being cleared after request execution.
    • Caching layers (like ConcurrentHashMap) growing unbounded.
    • Session attributes accumulating large blobs of data.

Why This Happens in Real Systems

In production environments, memory issues rarely manifest as a single “smoking gun” object. Instead, they appear as fragmented growth:

  • High Object Cardinality: Systems process millions of small objects (Strings, Integers, Map Nodes). When these are accidentally attached to a long-lived lifecycle, the overhead of the object headers and pointers becomes significant.
  • Resource Exhaustion via Swap: When the JVM heap grows, the OS tries to maintain stability by moving “cold” memory pages to Swap. Once swap is heavily used, the JVM’s Stop-the-World (STW) garbage collection cycles take exponentially longer because the GC must wait for disk I/O to retrieve objects.
  • Abstraction Layers: Modern frameworks like Spring and Netty add layers of abstraction (e.g., ByteBuf, ResourceEntry). This makes raw heap dumps look like a sea of framework internal classes rather than business objects.

Real-World Impact

  • Latency Spikes: Increased GC pause times lead to request timeouts and “flapping” health checks.
  • Cascading Failures: As one instance begins swapping, its response time increases, causing upstream services to retry, which further increases the load on the struggling instance.
  • OOM Killer Intervention: Eventually, the Linux Out-Of-Memory (OOM) Killer will target the process with the highest resident set size (RSS), causing an abrupt and ungraceful shutdown.

Example or Code (if necessary and relevant)

To move beyond a histogram, you must capture a full Heap Dump and analyze the Path to GC Roots.

# 1. Capture a full heap dump when memory is high
jmap -dump:live,format=b,file=heapdump.hprof 

# 2. Use jcmd to check memory usage statistics
jcmd  GC.heap_info

# 3. Check for specific large objects using jmap (histogram version)
jmap -histo:live  | head -n 20

How Senior Engineers Fix It

Senior engineers move from “what is the size” to “who owns this object.”

  • Heap Dump Analysis: Use tools like Eclipse MAT (Memory Analyzer Tool) or VisualVM. The most critical step is using the “Path to GC Roots” feature to find the object holding the reference.
  • Leak Suspect Reports: In MAT, run the “Leak Suspects Report” to automatically identify large accumulation points.
  • Allocation Profiling: If the leak is hard to find, use async-profiler or JProfiler in a staging environment to track where in the code the allocations are happening in real-time.
  • Observability: Implement Micrometer metrics to track heap usage (Eden, Survivor, Old Gen) and alert when the Old Generation fails to decrease after a Full GC.

Why Juniors Miss It

  • Focusing on the Symptom, Not the Cause: Juniors often try to “fix” the issue by increasing -Xmx (Max Heap Size). This merely delays the inevitable and actually makes the problem worse by increasing the time spent in GC pauses.
  • Misinterpreting Histograms: They see java.lang.String at the top and assume it is a “Java problem,” rather than looking for the holding collection (e.g., a HashMap inside a Service bean).
  • Ignoring the OS Layer: They overlook the significance of Swap usage. A high swap rate is a critical signal that the application has exceeded the physical memory limits of the container or VM, regardless of what the JVM heap settings say.

Leave a Comment