Summary
An intermittent production failure occurred in a Java 21 microservice where java.io.IOException: HTTP/2 stream was reset was thrown during POST requests containing JSON payloads. While the service functioned correctly under low load, the error surfaced under specific concurrency patterns and network conditions, leading to failed downstream transactions and increased error rates.
Root Cause
The failure is rooted in the behavior of the HTTP/2 protocol implementation within the Java HttpClient when interacting with certain middleboxes (proxies, load balancers, or WAFs).
- HTTP/2 Stream Multiplexing: Unlike HTTP/1.1, which uses separate TCP connections, HTTP/2 multiplexes multiple requests over a single connection.
- RST_STREAM Frames: The error occurs when the remote peer (or an intermediate proxy) sends an
RST_STREAMframe. This tells the client that the specific stream is closed, often due to:- Connection Idle Timeouts: The underlying TCP connection is kept alive, but the specific stream is killed by a proxy because it violated a policy.
- Flow Control Issues: The remote server sends a reset if the window size for the stream is mismanaged.
- Header/Payload Mismatches: Intermediate layers often reset streams if they detect an anomaly in the request body or if the connection has become “stale” due to a silent TCP timeout.
- The “Stale Connection” Problem: The
HttpClientattempts to reuse an existing HTTP/2 connection that the server or a load balancer has already decided to close, leading to an immediate reset when the request is dispatched.
Why This Happens in Real Systems
In local development or low-traffic environments, connections are frequently closed and reopened, hiding the issue. In production, the following factors exacerbate it:
- Connection Pooling: The
HttpClientmaintains a pool of connections to optimize performance. If a connection sits idle in the pool for longer than the Load Balancer’s idle timeout, the next request sent over that connection will trigger a reset. - L4 vs L7 Load Balancers: Layer 4 load balancers may drop TCP connections silently, while Layer 7 proxies (like Nginx or Envoy) might explicitly send
RST_STREAMif they reach a limit on concurrent streams or request sizes. - Concurrency Spikes: A sudden burst of requests might attempt to use a connection that is currently being terminated by the server’s “Keep-Alive” policy.
Real-World Impact
- Data Inconsistency: If a POST request is reset after the server has processed the logic but before the client receives the ACK, the client thinks the request failed, potentially leading to duplicate writes upon retry.
- Increased Latency: Frequent retries and connection renegotiations increase the tail latency (P99) of the service.
- Cascading Failures: If not handled with backoff, rapid retries can overwhelm the third-party API or the local thread pool.
Example or Code
To fix this, we must implement a retry mechanism specifically targeting IOException and consider falling back to HTTP/1.1 if HTTP/2 stability cannot be guaranteed by the provider.
import java.io.IOException;
import java.net.URI;
import java.net.http.*;
import java.time.Duration;
public class ReliableApiClient {
private static final HttpClient CLIENT = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(5))
// Falling back to HTTP/1.1 can resolve many RST_STREAM issues
// if the environment/proxy is unstable with HTTP/2
.version(HttpClient.Version.HTTP_1_1)
.build();
public static HttpResponse sendWithRetry(HttpRequest request, int maxRetries) throws IOException, InterruptedException {
int attempts = 0;
while (true) {
try {
return CLIENT.send(request, HttpResponse.BodyHandlers.ofString());
} catch (IOException e) {
attempts++;
if (attempts >= maxRetries || !isRetryable(e)) {
throw e;
}
// Exponential backoff to prevent hammering the server
Thread.sleep((long) Math.pow(2, attempts) * 100);
}
}
}
private static boolean isRetryable(IOException e) {
// Check if the error is an HTTP/2 stream reset or a connection reset
return e.getMessage().contains("stream was reset") ||
e.getMessage().contains("connection reset");
}
public static void main(String[] args) throws Exception {
String json = """
{"userId":123,"action":"ping","timestamp":"2026-03-04T10:00:00Z"}
""";
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://api.example.com/v1/events"))
.timeout(Duration.ofSeconds(10))
.header("Content-Type", "application/json")
.header("Authorization", "Bearer REDACTED")
.POST(HttpRequest.BodyPublishers.ofString(json))
.build();
HttpResponse response = sendWithRetry(request, 3);
System.out.println("Status: " + response.statusCode());
}
}
How Senior Engineers Fix It
- Protocol Downgrade: If the third-party endpoint is known to have unstable HTTP/2 implementations, we explicitly set
.version(HttpClient.Version.HTTP_1_1). The performance trade-off is usually negligible compared to the cost of failure. - Idempotency Keys: To safely retry POST requests, we ensure the API supports Idempotency-Key headers. This prevents the “duplicate record” problem during retries.
- Custom Retry Policy: Instead of a blind
try-catch, we implement a formal Retry Policy using libraries like Resilience4j, incorporating exponential backoff and jitter. - Observability: We add specific metrics to track
HTTP_2_RESEToccurrences. If the rate exceeds a threshold, it triggers an alert to investigate changes in the network topology or the provider’s infrastructure.
Why Juniors Miss It
- Focusing only on Logic: Juniors often assume that if the code works on
localhost, it will work in production. They fail to account for network intermediaries (Proxies/WAFs). - Ignoring Protocol Nuances: Most developers treat HTTP as a single entity. They don’t realize that HTTP/2 multiplexing introduces a whole new category of failure modes compared to the linear nature of HTTP/1.1.
- Naive Retries: A junior might implement a simple
forloop retry without exponential backoff, which can turn a minor network hiccup into a self-inflicted Denial of Service (DoS) attack against the upstream service.