Why does Kubernetes TCP readiness probe initially return “connection refused” for Kafka broker?

Summary

The Kubernetes TCP readiness probe initially returns a “connection refused” error for a Kafka broker due to the way TCP sockets and readiness probes work in Kubernetes. This issue is not specific to Kafka, but rather a general behavior of TCP readiness probes in Kubernetes.

Root Cause

The root cause of this issue is:

  • The Kafka broker takes some time to fully start and listen on the specified port (9092 in this case)
  • The TCP readiness probe checks if the port is open, but it does not guarantee that there is a process listening on that port
  • If the probe checks the port before the Kafka broker is fully started, it will return a “connection refused” error

Why This Happens in Real Systems

This happens in real systems because:

  • Container startup times can vary depending on the system resources and the complexity of the container startup process
  • Readiness probes are designed to check if a container is ready to receive traffic, but they do not account for the time it takes for the container to fully start
  • TCP sockets require a process to be listening on the port in order to establish a connection

Real-World Impact

The real-world impact of this issue is:

  • Delayed container startup: The container may take longer to start due to the repeated “connection refused” errors
  • Increased latency: The repeated probes can increase the latency of the system as a whole
  • Potential errors: If the probe fails repeatedly, it can lead to errors in the system, such as pod restarts or deployment failures

Example or Code (if necessary and relevant)

apiVersion: v1
kind: Pod
metadata:
  name: kafka-broker
spec:
  containers:
  - name: kafka-broker
    image: confluentinc/cp-kafka:5.4.3
    ports:
    - containerPort: 9092
    readinessProbe:
      tcpSocket:
        port: 9092
      initialDelaySeconds: 15
      periodSeconds: 5

How Senior Engineers Fix It

Senior engineers fix this issue by:

  • Increasing the initial delay of the readiness probe to give the container enough time to start
  • Adjusting the period of the probe to reduce the number of repeated probes
  • Using a more advanced readiness probe, such as an exec probe or an http probe, that can check the actual status of the container

Why Juniors Miss It

Juniors may miss this issue because:

  • Lack of understanding of how TCP sockets and readiness probes work in Kubernetes
  • Insufficient experience with container startup times and probe configurations
  • Overreliance on default configurations, which may not be suitable for all use cases

Leave a Comment