Prevent ESP32 Watchdog Resets by Avoiding Blocking Async Handlers

Summary

A production incident occurred where the ESP32 Hardware Watchdog Timer (WDT) triggered a system reset during specific HTTP request handling. The issue stemmed from an attempt to perform computationally expensive or blocking synchronous tasks within an asynchronous event handler provided by the ESPAsyncWebServer library. Because the library operates on a non-blocking architecture, long-running tasks prevent the underlying TCP stack and the core loop from yielding, leading the watchdog to conclude the system has hung.

Root Cause

The failure is caused by the violation of the asynchronous execution model.

  • Blocking the Event Loop: ESPAsyncWebServer is built on AsyncTCP, which relies on event-driven callbacks. When a handler is invoked, it runs within the context of the TCP stack’s execution flow.
  • Watchdog Starvation: If a handler executes a “heavy” synchronous task (e.g., complex math, large file I/O, or long loops), it prevents the CPU from returning to the background tasks required to “feed” the watchdog.
  • Lack of Yielding: Unlike the standard WebServer.h which runs in the main loop(), the async handler does not inherently yield control back to the scheduler or the watchdog management task until the function returns.

Why This Happens in Real Systems

In embedded systems and high-performance backend services, this is a classic Concurrency vs. Throughput trade-off.

  • Event-Driven Architectures: Libraries like AsyncTCP are designed to maximize throughput by handling many connections simultaneously without thread overhead. However, they are extremely fragile to synchronous blocking.
  • Resource Monopolization: A single “greedy” request can monopolize the CPU, causing a cascade failure where heartbeats, sensor readings, and network keep-alives all fail simultaneously.
  • Shared Context: Many embedded frameworks run the network stack and the application logic on the same hardware core or within the same task priority level, making it impossible to separate “work” from “system maintenance” without explicit design.

Real-World Impact

  • System Instability: Frequent, unpredictable reboots that clear volatile memory and disrupt ongoing client connections.
  • Service Unavailability: During the “expensive” task execution, the device becomes unresponsive to all other network requests, effectively causing a Denial of Service (DoS) for all other users.
  • Data Corruption: If a watchdog reset occurs while the device is writing to NVM (Non-Volatile Memory) or SD Card, it can lead to filesystem corruption.

Example or Code

The following pattern illustrates the dangerous “Task Queue” approach attempted by the user, which risks a Use-After-Free vulnerability if not handled with extreme care regarding object lifespans.

#include 
#include 

typedef void (*TaskCallback)(AsyncWebServerRequest *);

class Task {
public:
    AsyncWebServerRequest *req;
    TaskCallback callback;
    Task(AsyncWebServerRequest *req, TaskCallback callback) {
        this->req = req;
        this->callback = callback;
    }
};

std::deque toBeExecuted;
AsyncWebServer server(80);

void addTask(AsyncWebServerRequest *req) {
    // DANGER: 'req' pointer may become invalid once this handler returns
    toBeExecuted.push_back(Task(req, [](AsyncWebServerRequest *r) {
        // Heavy synchronous logic here
        r->send(200, "text/plain", "Done");
    }));
}

void setup() {
    server.on("/heavy-task", HTTP_GET, addTask);
    server.begin();
}

void loop() {
    if (!toBeExecuted.empty()) {
        Task t = toBeExecuted.front();
        t.callback(t.req); 
        toBeExecuted.pop_front();
    }
    delay(10); 
}

How Senior Engineers Fix It

A senior engineer would move away from a simple queue in the main loop and instead use FreeRTOS primitives to decouple the networking layer from the application logic.

  • Decoupling via Queues: Use xQueue to pass only the necessary data (not the request object itself) from the handler to a dedicated worker task.
  • Worker Tasks: Spawn a separate FreeRTOS task with a lower priority to process the heavy computation. This ensures the Network Task (High Priority) can always feed the watchdog.
  • State Management: Instead of passing a raw AsyncWebServerRequest* pointer (which is dangerous), use a response mechanism like an Event Group or a Semaphore to signal the worker task when the response is ready.
  • Response Strategy: If the task is too long, the engineer might implement a Polling or Webhook pattern: Return a 202 Accepted immediately, perform the work in the background, and let the client poll a different endpoint for the result.

Why Juniors Miss It

  • The Pointer Trap: Juniors often assume that because a pointer (AsyncWebServerRequest*) is valid during the function call, it remains valid indefinitely. They fail to realize that the scope and lifecycle of the request object are managed by the underlying TCP stack and are destroyed as soon as the handler finishes.
  • Synchronous Bias: Coming from standard Arduino tutorials, juniors are used to a single-threaded “everything happens in loop()” mindset. They struggle to grasp the non-deterministic timing of asynchronous event-driven systems.
  • Symptom vs. Cause: A junior sees a “Watchdog Reset” and tries to increase the watchdog timeout or add more delay() calls, rather than addressing the architectural flaw of blocking an asynchronous callback.

Leave a Comment