Django Docker production solutions

# Production Incident: Worker Timeouts After Migrating Django App From runserver to Gunicorn

## Summary
A Django application migrated from `runserver` to Gunicorn in Docker production experienced intermittent HTTP worker timeouts (`WORKER TIMEOUT`), resulting in incomplete data processing during TMDB API integrations. Development server worked without issues. Quick fix was reverting to `runserver`.

## Root Cause
Timeout occurred due to:
- Long-running synchronous requests exceeding Gunicorn's **default 30-second timeout**
- TMDB API operations blocking worker threads
- No explicit timeout configuration in Gunicorn
- Development server (`runserver`) having no request timeouts

## Why This Happens in Real Systems
- Production servers enforce worker timeouts to prevent resource starvation
- Externally-dependent operations are vulnerable to network latency
- Synchronous APIs/non-optimized database queries prolong request cycles
- Local/test environments rarely simulate production traffic volumes

## Real-World Impact
- **Data corruption**: Incomplete API processing → partial database records  
- **Reduced availability**: Workers killed → degraded capacity → client errors  
- **Operational overhead**: Manual recovery of failed imports required  

## Example or Code
Gunicorn configuration without timeout safeguards:
```bash
CMD ["gunicorn", "--bind", "0.0.0.0:8000", "--workers", "3", "localmovies.wsgi:application"]

Blocking API operation pattern:

def update_movies():
    movies = tmdb_api.fetch_all()  # Synchronous call taking >30s
    for movie in movies:            
        Movie.objects.update_or_create(...)  # Expensive DB ops

How Senior Engineers Fix It

  1. Configure appropriate timeouts:
    gunicorn --timeout 120 ... (set value exceeding worst-case request)

  2. Use asynchronous processing:

    from celery import shared_task
    
    @shared_task
    def update_movies_async():
        movies = tmdb_api.fetch_all()
        ...
  3. Optimize imports:

    • Paginated API fetching
    • Batch database operations with bulk_create
  4. Adaptiveness:

    gunicorn --workers=4 \
             --timeout=300 \
             --keep-alive=15 \
             --graceful-timeout=90 \
             ...
  5. Staging validation:
    Smoke test with production-scale data before deployment

Why Juniors Miss It

  • Over-reliance on dev server behavior in production
  • Unaware of production server default configurations
  • Underestimation of network-bound operations
  • Debugging logs focused on app errors vs infrastructure limits
  • Lack of performance benchmarking for data-heavy operations