Converting unchunked HDF5 to OME-Zarr with Dask

Summary

This postmortem analyzes why Dask workers show steadily increasing memory usage during an HDF5 → OME‑Zarr conversion when reading large, unchunked HDF5 slabs and writing many small Zarr chunks. Although the early tasks appear lightweight, later tasks accumulate hidden memory pressure from Python object retention, task graph expansion, filesystem latency, and Dask’s own scheduling heuristics.

Root Cause

The core issue is the mismatch between large, contiguous HDF5 reads and highly fragmented Zarr writes, combined with Dask’s memory‑intensive task orchestration.

Key contributors:

Large intermediate arrays created by reading (64, 704, 2008) blocks (~360–400 MB per block)
Multiple in‑memory copies of each block:
- HDF5 read buffer
- NumPy array materialization
- Zarr compressor input buffer
- Zarr chunk assembly buffer
Python reference retention inside delayed tasks and futures
Scheduler bookkeeping growth as thousands of tasks complete
Backpressure and spill-to-disk behavior triggered late in the workflow
HDD latency, causing workers to hold more data in memory while waiting on I/O

Why This Happens in Real Systems

This pattern is extremely common in HPC and distributed data pipelines:

Unchunked HDF5 forces large, contiguous reads that materialize huge arrays.
Chunked Zarr forces many small writes, each requiring compression and metadata updates.
Dask workers accumulate state as the task graph progresses.
Memory fragmentation grows over time due to Python’s allocator.
I/O bottlenecks cause workers to retain data longer than expected.

In short: the system becomes increasingly memory‑bound as the pipeline progresses, even though each individual task appears identical.

Real-World Impact

Systems with this pattern typically experience:

Worker memory spikes late in the job
Worker pauses or restarts
Scheduler slowdown as the task graph grows
Throughput collapse due to HDD I/O stalls
Unpredictable memory overhead, making capacity planning difficult

Example or Code (if necessary and relevant)

Below is a minimal example showing how large HDF5 reads and many Zarr writes create multiple in‑memory copies:

with h5py.File(path, "r") as f:
    block = f["data"][z0:z1, y0:y1, x0:x1]  # read buffer + NumPy array
zarr_group["0"][z0:z1, y0:y1, x0:x1] = block  # compression + chunk assembly

Even this simple snippet produces 3–5 transient copies of the data.

How Senior Engineers Fix It

Experienced engineers use a combination of I/O‑aware chunking, task‑graph simplification, and memory‑bounded execution:

1. Align read blocks with write chunks

Read blocks that match or tile cleanly into Zarr chunks
Avoid reading huge slabs that explode into hundreds of Zarr writes

2. Reduce the number of tasks

Use Dask array instead of thousands of delayed tasks
Let Dask handle chunk‑to‑chunk mapping automatically

3. Limit worker concurrency

Set --nthreads=1 but reduce number of workers
Ensure each worker has enough RAM for worst‑case block inflation

4. Use `worker_memory_target` and `worker_memory_spill`

Force Dask to spill early instead of late
Prevent runaway memory growth

5. Avoid HDD for intermediate writes

Use SSD or RAM‑disk for temporary Zarr stores
HDD latency amplifies memory retention

6. Pre‑chunk the HDF5 file

Convert HDF5 to chunked HDF5 first
Then convert chunked HDF5 → Zarr efficiently

7. Use `zarr.copy_all` or `dask.array.from_array`

These paths avoid Python‑level overhead and reduce memory copies

Why Juniors Miss It

Less experienced engineers often assume:

Each task uses the same memory, so memory should stay constant
(false: fragmentation and retention accumulate)
Dask frees memory immediately after each task
(false: references persist until futures resolve)
I/O latency doesn’t affect memory
(false: slow writes cause buffers to pile up)
HDF5 reads are cheap
(false: unchunked reads create massive temporary arrays)
Zarr writes are constant‑cost
(false: compression and chunk assembly vary with data patterns)

The result is a system that looks simple on paper but becomes progressively more memory‑intensive as the job runs.

If you want, I can also generate a capacity‑planning formula for predicting worker memory requirements based on block size, chunk size, compression, and concurrency.