Summary
This incident documents why a MySQL InnoDB Cluster deployed via Docker Compose failed to form a healthy, production‑grade HA cluster, despite all containers starting successfully. The root cause was not a single misconfiguration but a combination of orchestration, networking, and initialization‑order issues that commonly appear when stateful distributed systems are forced into a stateless container model.
Root Cause
The cluster failed because MySQL InnoDB Cluster requires deterministic startup sequencing, stable hostnames, and consistent state, none of which were guaranteed in the provided Compose setup.
Key root‑cause factors:
- MySQL Shell attempted to bootstrap the cluster before all nodes were fully initialized, even though health checks passed.
- Docker Compose hostnames (
mysql-server-1, etc.) were not resolvable inside MySQL itself, causingcluster.addInstance()to fail intermittently. - GTID‑based replication requires synchronized metadata, but each container started with a fresh data directory.
- MySQL 8.0.13 is too old for stable InnoDB Cluster behavior, and several bugs in that release affect Group Replication.
- No persistent volumes, meaning nodes lost state on restart and rejoined incorrectly.
- Router started before the cluster was fully formed, causing connection routing failures.
Why This Happens in Real Systems
Distributed databases are extremely sensitive to startup order, state consistency, and network identity. In containerized environments, these guarantees are often violated.
Common systemic reasons:
- Containers start “fast,” but databases start “slow.”
- Health checks only verify process liveness, not replication readiness.
- DNS inside Docker networks is eventually consistent, not immediate.
- Group Replication requires strict timing and state coordination, which Compose cannot enforce.
- Stateless orchestration tools do not understand stateful consensus protocols.
Real-World Impact
When this pattern appears in production, the consequences are severe:
- Cluster never forms, leaving applications stuck in read‑only mode.
- Split‑brain scenarios if nodes start with divergent GTID sets.
- Router sends traffic to non‑primary nodes, causing write failures.
- Data loss when containers restart without persistent storage.
- Operational instability because every restart changes node identity.
Example or Code (if necessary and relevant)
Below is a minimal example of how a stable hostname and volume definition should look in Compose:
services:
mysql1:
image: mysql:8.0
hostname: mysql1
volumes:
- mysql1_data:/var/lib/mysql
This is not a full solution—just an illustration of the type of configuration missing from the original setup.
How Senior Engineers Fix It
Experienced engineers avoid running InnoDB Cluster in raw Docker Compose entirely, but if forced to, they apply several critical fixes:
- Pin stable hostnames using
hostname:andnetworks:. - Add persistent volumes for all MySQL nodes.
- Use MySQL 8.0.30+, where Group Replication is significantly more stable.
- Replace health checks with readiness checks that verify:
- server_id
- GTID mode
- replication state
- Introduce a bootstrap script that waits for:
- all nodes to be reachable
- all nodes to have initialized data directories
- MySQL Shell to connect successfully
- Start Router only after the cluster is fully formed.
- Avoid Compose for production and use Kubernetes StatefulSets or native MySQL InnoDB Cluster deployment tooling.
Why Juniors Miss It
Less‑experienced engineers often assume that:
- If containers are healthy, the database is ready — which is false.
- Docker Compose can orchestrate stateful distributed systems — it cannot.
- MySQL hostnames inside containers behave like normal DNS — they do not.
- Group Replication “just works” — it requires strict sequencing.
- Restarting containers is harmless — it destroys cluster state without volumes.
The failure is not due to lack of effort but due to the hidden complexity of distributed consensus systems, which only becomes obvious with experience.