mongodb-atlas-local container becomes unhealthy after ~20 minutes

Summary

The mongodb-atlas-local Docker container becomes unresponsive after approximately 15-20 minutes of normal operation, with the mongod process freezing and stopping logging, responding to connections, and WiredTiger checkpoints. This issue occurs on both local machines and Linux-based self-hosted runners, with or without connections to the container.

Root Cause

The root cause of this issue is likely due to the accumulation of TCP connections in the CLOSE_WAIT state on port 27017, which can cause the mongod process to freeze. Possible causes include:

Healthcheck connections not being properly closed
Connection timeouts not being properly handled
Resource leaks causing the container to become unresponsive

Why This Happens in Real Systems

This issue can occur in real systems due to:

Insufficient resource allocation, leading to resource constraints and container unresponsiveness
Inadequate connection management, resulting in accumulated connections and process freezing
Incompatible or outdated dependencies, causing compatibility issues and container crashes

Real-World Impact

The real-world impact of this issue includes:

Downtime and unavailability of the MongoDB service
Data loss or corruption due to the container becoming unresponsive
Increased latency and decreased performance caused by the accumulation of connections and process freezing

Example or Code

docker exec xi-mongodb-atlas-1 mongosh --eval "db.runCommand({ping: 1})"

This command can be used to test the connection to the MongoDB container and verify if it is responsive.

How Senior Engineers Fix It

Senior engineers can fix this issue by:

Increasing resource allocation to the container to prevent resource constraints
Implementing proper connection management, including connection timeouts and closure of healthcheck connections
Monitoring container performance and adjusting settings as needed to prevent downtime and unavailability
Updating dependencies to ensure compatibility and prevent crashes

Why Juniors Miss It

Junior engineers may miss this issue due to:

Lack of experience with containerization and Docker
Insufficient understanding of connection management and resource allocation
Inadequate testing and verification of container responsiveness and performance
Failure to monitor container logs and performance metrics, leading to delayed detection of issues