Why does CPU usage suddenly reach 100% on an AWS EC2 instance running a Next.js application?

Summary

This incident describes a sudden and sustained 100% CPU spike on an AWS t3.micro instance running a Next.js application. After the spike, the Node.js process is killed, often without meaningful logs. This pattern strongly suggests resource exhaustion, runaway background tasks, or event‑loop blocking triggered by recent code changes.

Root Cause

The most common root causes for this exact behavior on small EC2 instances include:

  • CPU credit depletion on burstable T‑series instances (t3.micro has extremely limited credits)
  • Unbounded background loops (cron-like logic, polling, while-loops, retry storms)
  • Event-loop blocking caused by:
    • heavy JSON parsing
    • large in-memory operations
    • synchronous filesystem or crypto calls
  • Socket.io or WebRTC (Agora) tasks running even without active users
  • Memory pressure causing the kernel OOM killer to terminate Node.js
  • Leaked intervals/timeouts created after recent code changes
  • Unbounded Sentry instrumentation or excessive logging

Why This Happens in Real Systems

These failures are extremely common in production systems because:

  • Small instances hide problems until traffic or background tasks accumulate.
  • Node.js is single-threaded, so one blocking function can freeze the entire server.
  • Developers underestimate background workloads, especially:
    • WebSocket heartbeats
    • real-time presence tracking
    • retry loops for external APIs
  • t3.micro instances have only 2 vCPUs and limited CPU credits, making them fragile under sustained load.
  • Memory leaks escalate slowly, then suddenly trigger OOM kills.

Real-World Impact

When CPU spikes to 100% on a t3.micro:

  • Requests slow down or time out
  • WebSocket connections drop
  • The kernel kills Node.js due to memory or CPU starvation
  • Logs show nothing because the process dies before flushing output
  • Auto-restarts create a crash loop, making debugging harder

Example or Code (if necessary and relevant)

A common hidden cause is an unbounded interval created during a recent code change:

setInterval(async () => {
  await expensiveOperation(); // CPU-heavy or blocking
}, 1000);

Or a blocking synchronous call inside a request handler:

app.get("/data", (req, res) => {
  const result = fs.readFileSync("/large-file.json"); // blocks event loop
  res.send(result);
});

How Senior Engineers Fix It

Experienced engineers approach this systematically:

  • Switch instance type
    • Move from t3.micro → t3.small / t3.medium or t3 → t3a / t4g
    • Or use t3 → t3.unlimited to avoid CPU credit exhaustion
  • Instrument the event loop
    • Use tools like clinic.js, 0x, or Node’s built-in --inspect
  • Add CPU profiling
    • Identify blocking functions, large loops, or heavy synchronous calls
  • Audit all intervals, timeouts, and background tasks
  • Throttle or debounce real-time features (Socket.io, Agora)
  • Add memory and CPU dashboards using CloudWatch or Prometheus
  • Enable PM2 or systemd restart policies with backoff
  • Review recent code changes for:
    • new loops
    • new listeners
    • new async tasks that never resolve

Why Juniors Miss It

Less experienced engineers often overlook this because:

  • They assume “no traffic” means “no load”, forgetting background tasks run continuously.
  • They don’t know t3.micro uses CPU credits, not guaranteed CPU.
  • They rely only on application logs, unaware that OOM kills happen at the OS level.
  • They underestimate the cost of synchronous operations in Node.js.
  • They don’t monitor event-loop lag, only request logs.
  • They assume WebSockets are idle, even though heartbeats and pings consume CPU.

This combination makes the issue appear mysterious, even though the underlying cause is predictable once you understand how small EC2 instances and Node.js behave under load.

Leave a Comment