Understanding unexpected memory pressure in container clusters

Summary

During a high-traffic event, our cluster experienced an unexpected spike in memory pressure despite the number of running containers remaining constant. Investigation revealed a misunderstanding of how containerization interacts with OS-level memory management. While modern operating systems use page sharing to map multiple processes to a single physical copy of a binary image in RAM, the behavior in containerized environments is more nuanced due to layering and filesystem isolation.

Root Cause

The issue stems from the distinction between Process-level sharing and Storage-level sharing.

  • The OS Level: When a standard OS runs two instances of /bin/bash, the kernel recognizes the file on disk and maps the same physical memory pages (the text segment) to both processes. This is called demand paging.
  • The Container Level: Containers use Union File Systems (like OverlayFS). While the underlying layers are read-only and shared, the way the container runtime (Docker/Podman) mounts these layers can lead to overhead.
  • The Divergence: If the container runtime or the underlying driver triggers a Copy-on-Write (CoW) operation on a file that was supposed to be shared, or if the files are accessed via different mount points that the kernel fails to deduplicate, the memory savings vanish.
  • The Execution: When an executable is loaded, the kernel looks at the inode. If multiple containers are running the exact same binary from the exact same underlying inode on the same host, the kernel will share the memory pages. However, if the container setup involves unique filesystem mounts or different storage drivers, the kernel may treat them as distinct files, leading to duplicate memory allocation.

Why This Happens in Real Systems

In complex production environments, “identical” images often aren’t identical at the kernel level due to:

  • OverlayFS Nuances: The way layers are merged can sometimes lead to the kernel seeing different file handles for what appears to be the same binary.
  • Storage Driver Overhead: Different drivers (e.g., devicemapper vs overlay2) handle metadata and file access differently, affecting the kernel’s ability to perform page deduplication.
  • Dynamic Linking: If different containers use different versions of shared libraries (.so files) located in different layers, the kernel cannot share the memory for those libraries, even if the main executable is the same.

Real-World Impact

  • Increased Memory Footprint: Instead of memory usage scaling linearly with data, it scales linearly with the number of containers.
  • Reduced Density: We cannot pack as many microservices onto a single node as theoretically predicted, leading to increased infrastructure costs.
  • OOM Kills: Sudden spikes in container instantiation can trigger the Out-Of-Memory (OOM) Killer, crashing critical services because the “shared” memory assumption failed.

Example or Code (if necessary and relevant)

# Check if two processes are sharing the same executable mapping
# Inspecting /proc/[pid]/maps shows the file path and memory permissions

# To verify if the kernel is actually sharing pages, 
# we use the 'smem' tool which reports PSS (Proportional Set Size)
# PSS accounts for shared memory by dividing it by the number of processes using it.

smem -k -t

# If PSS is significantly lower than RSS (Resident Set Size), 
# sharing is occurring. If PSS is nearly equal to RSS, 
# the processes are not sharing memory pages.

How Senior Engineers Fix It

  • Optimize Base Images: Use minimalist base images (like Alpine or Distroless) to ensure that the number of shared library dependencies is minimized and standardized.
  • Standardize Storage Drivers: Ensure all nodes in a cluster use the same storage driver (e.g., overlay2) to maintain predictable kernel behavior.
  • Monitor PSS, Not Just RSS: We stopped monitoring RSS (Resident Set Size) and moved to PSS (Proportional Set Size) in our Prometheus/Grafana dashboards. RSS often overestimates memory usage by counting shared pages multiple times.
  • Layer Alignment: Structure Dockerfiles so that common binaries and libraries reside in the lowest possible layers, maximizing the chance of inode reuse across different images.

Why Juniors Miss It

  • The “Container is a VM” Fallacy: Juniors often treat containers like Virtual Machines, assuming they are entirely isolated. They fail to realize that containers share the host kernel and are subject to the host’s memory management rules.
  • Confusing Disk vs. RAM: A junior might see that an image only takes up 100MB on disk and assume 10 containers will only take 100MB of RAM. They miss the fact that execution context and page faults drive actual RAM consumption.
  • Ignoring PSS/USS Metrics: Most tutorials focus on top or free, which provide global or RSS-based views. Juniors often overlook Proportional Set Size (PSS), which is the only way to truly see the “cost” of a shared memory page.

Leave a Comment