Summary
A wrong diffID on extraction during docker load almost always indicates bit‑level corruption of a layer tarstream during transfer or readback. In this case, the failure on a single large (~6 GB) CUDA/PyTorch layer strongly suggests silent corruption in transit or on disk, not a Docker bug and not an architecture mismatch between RTX 5070 and RTX 3090.
Root Cause
The diffID mismatch occurs when Docker computes the SHA256 of the extracted layer contents and the result does not match the digest recorded in the image manifest. This happens when:
- The tar file contains corrupted bytes (most common)
- The filesystem returns corrupted reads during extraction
- The storage driver writes corrupted data into the layer directory
- The transfer medium (Google Drive, USB, network) altered the file
- Hardware issues (bad RAM, failing SSD, unstable NVMe controller)
The key point: Docker is not “guessing”—it is recomputing the hash of the extracted layer and detecting a mismatch.
Why This Happens in Real Systems
Large CUDA/PyTorch layers stress the system in ways small images do not:
- Large layers amplify bit‑flip probability during transfer or storage.
- Cloud sync tools (Google Drive, Dropbox) sometimes perform partial retries or chunk merges that silently corrupt multi‑GB files.
- NVMe drives under thermal throttling can return inconsistent reads.
- RAM instability appears only under heavy decompression workloads.
- Overlay2 extraction is CPU‑ and IO‑intensive, exposing borderline hardware issues.
In practice, 6–10 GB layers are the most common size where corruption becomes visible.
Real-World Impact
When this happens, engineers typically see:
- Deterministic failure on the same layer every time
- Successful loads on other machines, proving the tar is not universally broken
- Different final image sizes when building on different hosts, due to:
- Different base image versions pulled at build time
- Different apt repository states
- Different CUDA/PyTorch wheel caching behavior
- Divergent filesystem compression/dedup behavior
These size differences are symptoms of nondeterministic builds, not the cause of the diffID error.
Example or Code (if necessary and relevant)
A minimal integrity check that senior engineers run before blaming Docker:
sha256sum my-saved-image.tar
Run this on both machines and compare. If the checksums differ, the tar was corrupted in transit.
To verify internal layer integrity:
tar -tvf my-saved-image.tar > /dev/null
If this fails or hangs, the tar is corrupted.
How Senior Engineers Fix It
Senior engineers approach this systematically:
- Verify the tar integrity using
sha256sumbefore and after transfer. - Avoid cloud sync tools for multi‑GB Docker images.
- Use rsync with checksums (
rsync -avP --checksum) for transfers. - Test the filesystem using
fsckor vendor NVMe diagnostics. - Test RAM using
memtest86when corruption appears deterministic. - Load from a different disk (e.g., external SSD) to isolate IO issues.
- Rebuild the image deterministically:
- Pin apt repositories
- Pin CUDA/PyTorch wheel versions
- Avoid
apt-get upgrade - Use multi‑stage builds to reduce layer size
If the image loads on two 5090 machines but not on the 3090 machine, the most likely cause is hardware or filesystem instability on the 3090 host.
Why Juniors Miss It
Juniors often assume:
- “Docker is broken” instead of suspecting the transfer medium.
- “The GPU architecture difference matters” (it does not for image loading).
- “If the tar extracts with tar, it must be fine” (Docker recomputes diffIDs on extracted content, not on the tarstream).
- “Google Drive is reliable for large binary blobs” (it is not).
- “If the error is deterministic, it must be a software bug” (deterministic corruption is a hallmark of bad sectors or bad RAM).
They rarely consider silent data corruption, which is exactly what diffID mismatches are designed to detect.
If you want, I can also generate a deterministic Dockerfile pattern that avoids multi‑GB layers and makes cross‑machine builds reproducible.