Issue with Spring Boot/Webflux APIs on Kubernetes

Summary

Spring Boot/Webflux APIs on Kubernetes failed to respond to health checks during load testing, causing Kubernetes to restart the service. The issue arose due to thread pool exhaustion under high concurrency, preventing health check endpoints from being serviced.

Root Cause

Thread pool exhaustion: Webflux’s default thread pool size was insufficient to handle both application requests and health checks under load.
Kubernetes probes timing out: Liveness and readiness probes failed after 45 seconds, triggering pod restarts.

Why This Happens in Real Systems

Non-blocking I/O assumptions: Webflux is reactive, but downstream services or blocking operations can still saturate threads.
Default configurations: Spring Boot and Kubernetes defaults are not optimized for high-concurrency scenarios.
Resource contention: Limited CPU/memory in Kubernetes pods exacerbates thread pool limitations.

Real-World Impact

Service downtime: Pods restarted unnecessarily, leading to intermittent API unavailability.
Degraded user experience: Load test failures masked actual application performance issues.
Operational overhead: Frequent restarts increased resource consumption and monitoring alerts.

Example or Code (if necessary and relevant)

@Bean
public ReactorResourceFactory resourceFactory() {
    ReactorResourceFactory factory = new ReactorResourceFactory();
    factory.setUseGlobalResources(false);
    factory.setWorkerThreadCount(100); // Increase thread pool size
    return factory;
}

How Senior Engineers Fix It

Increase thread pool size: Configure ReactorResourceFactory to allocate more threads for Webflux.
Optimize Kubernetes probes: Adjust probe timeouts and periods to match application behavior.
Monitor thread usage: Use metrics to detect thread pool saturation early.
Tune Kubernetes resources: Allocate more CPU/memory to pods to handle higher concurrency.

Why Juniors Miss It

Assumption of reactivity: Juniors often assume Webflux handles all concurrency without thread management.
Overlooking defaults: Failure to review and adjust default configurations for production workloads.
Lack of load testing experience: Insufficient testing to uncover edge cases under high concurrency.