Summary
Running an expensive function with a for loop on multiple GPUs using pmap results in out-of-memory errors, despite each GPU handling the same workload as in serial execution. The issue arises from JAX’s memory allocation behavior during parallel compilation and execution.
Root Cause
- JAX’s
pmapreplicates the function across GPUs, leading to redundant memory allocation for each replica. - The for loop inside
singledevice_funccauses JAX to compile and execute the loop for all GPUs simultaneously, multiplying memory usage. - Memory fragmentation occurs due to inefficient allocation and deallocation during parallel execution.
Why This Happens in Real Systems
- Parallelism overhead:
pmapreplicates computations, increasing memory pressure. - Eager compilation: JAX compiles the entire function for all GPUs upfront, including the for loop.
- Lack of memory sharding: Inputs and intermediates are not explicitly partitioned across GPUs.
Real-World Impact
- Resource exhaustion: GPUs run out of memory, halting execution.
- Inefficient scaling: Parallelization fails to improve performance due to memory constraints.
- Increased latency: Errors force fallback to slower serial execution.
Example or Code (if necessary and relevant)
import jax
import jax.numpy as jnp
from jax import pmap
N = 8
m = 10
inputs = jnp.array(jnp.arange(N*m).reshape(N, m), dtype=jnp.float32)
def expensive_func(inp):
return jnp.sum(inp ** 2)
def singledevice_func(inds, inputs):
batch_size = inds.shape[0]
accum = 0.0
for i in range(batch_size):
val = expensive_func(inputs[inds[i]])
accum += val
return accum
singledevice_pmapped = pmap(
singledevice_func,
in_axes=(0, None),
out_axes=0
)
ngpu = jax.device_count() # 4
inds_batched = jnp.arange(N).reshape(ngpu, N // ngpu)
accum_dev = singledevice_pmapped(inds_batched, inputs)
accum_dev.block_until_ready()
accum_final = jnp.sum(accum_dev)
print(accum_final)
How Senior Engineers Fix It
- Shard inputs explicitly: Use
jax.tree_mapto partitioninputsacross GPUs. - Avoid loops in parallel regions: Replace the for loop with vectorized operations or
jax.vmap. - Reduce memory footprint: Use
jitwithstatic_argnumsto compile the loop body only once. - Manual sharding: Distribute work manually instead of relying on
pmapfor complex cases.
Why Juniors Miss It
- Assumption of linear scaling: Believing parallel execution directly translates to memory efficiency.
- Overlooking JAX’s replication: Not accounting for
pmap‘s memory duplication across devices. - Ignoring loop compilation: Forgetting that JAX compiles loops for all GPUs simultaneously.