# Production Incident: Worker Timeouts After Migrating Django App From runserver to Gunicorn
## Summary
A Django application migrated from `runserver` to Gunicorn in Docker production experienced intermittent HTTP worker timeouts (`WORKER TIMEOUT`), resulting in incomplete data processing during TMDB API integrations. Development server worked without issues. Quick fix was reverting to `runserver`.
## Root Cause
Timeout occurred due to:
- Long-running synchronous requests exceeding Gunicorn's **default 30-second timeout**
- TMDB API operations blocking worker threads
- No explicit timeout configuration in Gunicorn
- Development server (`runserver`) having no request timeouts
## Why This Happens in Real Systems
- Production servers enforce worker timeouts to prevent resource starvation
- Externally-dependent operations are vulnerable to network latency
- Synchronous APIs/non-optimized database queries prolong request cycles
- Local/test environments rarely simulate production traffic volumes
## Real-World Impact
- **Data corruption**: Incomplete API processing → partial database records
- **Reduced availability**: Workers killed → degraded capacity → client errors
- **Operational overhead**: Manual recovery of failed imports required
## Example or Code
Gunicorn configuration without timeout safeguards:
```bash
CMD ["gunicorn", "--bind", "0.0.0.0:8000", "--workers", "3", "localmovies.wsgi:application"]
Blocking API operation pattern:
def update_movies():
movies = tmdb_api.fetch_all() # Synchronous call taking >30s
for movie in movies:
Movie.objects.update_or_create(...) # Expensive DB ops
How Senior Engineers Fix It
-
Configure appropriate timeouts:
gunicorn --timeout 120 ...(set value exceeding worst-case request) -
Use asynchronous processing:
from celery import shared_task @shared_task def update_movies_async(): movies = tmdb_api.fetch_all() ... -
Optimize imports:
- Paginated API fetching
- Batch database operations with
bulk_create
-
Adaptiveness:
gunicorn --workers=4 \ --timeout=300 \ --keep-alive=15 \ --graceful-timeout=90 \ ... -
Staging validation:
Smoke test with production-scale data before deployment
Why Juniors Miss It
- Over-reliance on dev server behavior in production
- Unaware of production server default configurations
- Underestimation of network-bound operations
- Debugging logs focused on app errors vs infrastructure limits
- Lack of performance benchmarking for data-heavy operations