Dagster DockerRunLauncher fails with ConnectionRefusedError on /var/run/docker.sock after host Docker daemon restart

Summary

A long‑running Dagster deployment using DockerRunLauncher began failing with ConnectionRefusedError when attempting to talk to the host Docker daemon through /var/run/docker.sock. The host daemon had been restarted, but the Dagster container continued using a stale bind‑mounted socket inode, causing all Docker API calls to fail until the container itself was restarted.

Root Cause

The root cause is a stale bind mount of /var/run/docker.sock inside the Dagster daemon container.

When the host Docker daemon restarts, it recreates the Unix socket file.
Containers do not automatically refresh bind mounts, so the container continues referencing the old inode, which no longer corresponds to a running daemon.

This leads to:

A visible socket file inside the container, but with old timestamp and wrong ownership
docker.from_env() failing with ConnectionRefusedError
Dagster’s DockerRunLauncher unable to create new containers

Why This Happens in Real Systems

This is a classic Unix filesystem behavior, not a Dagster bug.

Key reasons:

Bind mounts map inodes, not paths
When the host replaces /var/run/docker.sock, the container still points to the old inode.
Docker daemon restarts recreate the socket
The socket file is ephemeral and replaced on daemon startup.
Containers do not re‑mount volumes automatically
Docker Compose does not monitor host file changes.
Long‑running containers accumulate stale mounts
Especially when the host daemon restarts for upgrades or crashes.

Real-World Impact

A stale socket breaks all container‑orchestrated workloads that rely on the Docker API.

Common symptoms:

Dagster run queue stalls
No new runs can be launched.
Health checks fail
Any healthcheck using docker.from_env() begins failing.
Silent partial outages
The container appears healthy but cannot perform its core function.
Operational confusion
Host tools (docker ps, docker version) work fine, misleading operators.

Example or Code (if necessary and relevant)

A minimal Python example showing the failure mode:

import docker
client = docker.from_env()
client.ping()  # Raises ConnectionRefusedError when socket inode is stale

How Senior Engineers Fix It

Experienced engineers treat /var/run/docker.sock as an ephemeral resource and design around its volatility.

Typical fixes:

Restart dependent containers whenever the Docker daemon restarts
Use a systemd unit, cron job, or monitoring hook to restart Dagster automatically.
Use Docker’s built‑in event stream to detect daemon restarts and trigger container restarts.
Run Dagster on the host or inside the Docker daemon’s own namespace
Avoid bind‑mounting the socket entirely.
Switch to Kubernetes or ECS launchers
These avoid direct Docker socket dependency.
Use a sidecar that proxies Docker API calls
The sidecar reconnects automatically; the main container talks to the proxy.

Most production setups choose the simplest:
Restart the Dagster daemon container whenever the host Docker daemon restarts.

Why Juniors Miss It

This issue is subtle because:

The socket file still exists inside the container, so it looks correct at first glance.
Host‑side Docker commands work perfectly, hiding the real problem.
Bind mounts are assumed to be “live,” but they are actually static inode mappings.
Dagster’s error message points to Docker, not the underlying filesystem behavior.
The failure appears only after days or weeks, making it hard to correlate with a daemon restart.

A junior engineer often checks permissions, groups, or Dagster configuration, while a senior engineer immediately suspects inode replacement and stale mounts.

If you want, I can also outline a production‑grade restart strategy tailored to your Compose setup.

Dagster DockerRunLauncher fails with ConnectionRefusedError on /var/run/docker.sock after host Docker daemon restart – stale bind mount in container