Run expensive function (containing for loop) on multiple GPUs. pmap gives out of memory error

Summary

Running an expensive function with a for loop on multiple GPUs using pmap results in out-of-memory errors, despite each GPU handling the same workload as in serial execution. The issue arises from JAX’s memory allocation behavior during parallel compilation and execution.

Root Cause

  • JAX’s pmap replicates the function across GPUs, leading to redundant memory allocation for each replica.
  • The for loop inside singledevice_func causes JAX to compile and execute the loop for all GPUs simultaneously, multiplying memory usage.
  • Memory fragmentation occurs due to inefficient allocation and deallocation during parallel execution.

Why This Happens in Real Systems

  • Parallelism overhead: pmap replicates computations, increasing memory pressure.
  • Eager compilation: JAX compiles the entire function for all GPUs upfront, including the for loop.
  • Lack of memory sharding: Inputs and intermediates are not explicitly partitioned across GPUs.

Real-World Impact

  • Resource exhaustion: GPUs run out of memory, halting execution.
  • Inefficient scaling: Parallelization fails to improve performance due to memory constraints.
  • Increased latency: Errors force fallback to slower serial execution.

Example or Code (if necessary and relevant)

import jax
import jax.numpy as jnp
from jax import pmap

N = 8
m = 10
inputs = jnp.array(jnp.arange(N*m).reshape(N, m), dtype=jnp.float32)

def expensive_func(inp):
    return jnp.sum(inp ** 2)

def singledevice_func(inds, inputs):
    batch_size = inds.shape[0]
    accum = 0.0
    for i in range(batch_size):
        val = expensive_func(inputs[inds[i]])
        accum += val
    return accum

singledevice_pmapped = pmap(
    singledevice_func, 
    in_axes=(0, None), 
    out_axes=0
)

ngpu = jax.device_count()  # 4
inds_batched = jnp.arange(N).reshape(ngpu, N // ngpu)
accum_dev = singledevice_pmapped(inds_batched, inputs)
accum_dev.block_until_ready()
accum_final = jnp.sum(accum_dev)
print(accum_final)

How Senior Engineers Fix It

  • Shard inputs explicitly: Use jax.tree_map to partition inputs across GPUs.
  • Avoid loops in parallel regions: Replace the for loop with vectorized operations or jax.vmap.
  • Reduce memory footprint: Use jit with static_argnums to compile the loop body only once.
  • Manual sharding: Distribute work manually instead of relying on pmap for complex cases.

Why Juniors Miss It

  • Assumption of linear scaling: Believing parallel execution directly translates to memory efficiency.
  • Overlooking JAX’s replication: Not accounting for pmap‘s memory duplication across devices.
  • Ignoring loop compilation: Forgetting that JAX compiles loops for all GPUs simultaneously.

Leave a Comment