Envoy data plane container not coming up : startup probe failing, grpc ERROR and xDS connection unsuccessful

Summary

A pod running Envoy 1.6.2 (data plane) in Kubernetes failed to become ready because the startup probe could not connect to the Envoy admin interface (:19003/ready). Envoy logs indicated a persistent gRPC connection failure and DNS resolution issues when trying to establish the xDS connection to the control plane (envoy-gateway.envoy-gateway-system.svc.cluster.local). The root cause was identified as a Kubernetes DNS configuration mismatch: the data plane Pod was configured to use the host node’s DNS resolver (hostNetwork: true or custom dnsConfig), which lacked access to the cluster’s internal DNS service (coredns), preventing resolution of internal ClusterIP services.

Root Cause

The fundamental issue was a DNS resolution failure for the control plane service within the Envoy container. Because the data plane pod could not resolve the internal Kubernetes service name envoy-gateway.envoy-gateway-system.svc.cluster.local to a ClusterIP, the xDS gRPC stream could not be established.

The specific sequence of failure was:

  • Configuration Mismatch: The data plane pod was running with hostNetwork: true or a custom dnsConfig that pointed to external resolvers (e.g., 8.8.8.8).
  • DNS Query Failure: Envoy issued a DNS query for the control plane service. The external DNS server returned NOERROR (successful query but empty answer) because it has no record of internal Kubernetes services.
  • gRPC Connection Timeout: Since no IP was resolved, Envoy attempted to connect to an undefined IP or timed out trying to establish the xDS stream.
  • Startup Probe Failure: Envoy’s main thread was blocked waiting for the xDS configuration. Consequently, the admin server on port 19003 was not fully initialized or accepting connections, causing the Kubernetes startup probe to fail with “connection refused.”

Why This Happens in Real Systems

This is a common anti-pattern in Service Mesh and CNI configurations where L7 proxies (like Envoy) interact with L4 networking constraints.

  • Service Discovery Dependency: Envoy relies entirely on the control plane to bootstrap its configuration. The control plane is almost always an internal ClusterIP Service.
  • DNS as the Bootstrap Mechanism: The very first step of Envoy’s startup is resolving the address of the xDS server. If the DNS resolver in the container cannot see the internal Kubernetes DNS (coredns), the system deadlocks: no config means no server, and no server means the health check fails.
  • HostNetwork Isolation: When hostNetwork: true is used (often for performance or CNI reasons), the Pod inherits the Node’s /etc/resolv.conf. If the Node’s DNS is not configured to forward queries for cluster.local to the Cluster DNS IP, internal service names will never resolve.

Real-World Impact

  • Service Mesh Outage: If this occurs during an upgrade or deployment, the data plane nodes effectively go dark, dropping all traffic passing through that node.
  • Control Plane Blackhole: The control plane cannot push updates to the data plane, leaving Envoy stuck with stale configuration.
  • Debugging Complexity: The logs show “DNS success” (because the external server responded) but “no data,” which can be misleading. Engineers often look for network blocks or firewall issues rather than DNS configuration mismatches.

Example or Code

Below is the configuration required to reproduce or fix this issue. The critical fix is ensuring dnsPolicy is set to ClusterFirst (if not using hostNetwork) or configuring dnsConfig to point to the internal CoreDNS service IP if hostNetwork is mandatory.

The Problematic Configuration (Host Network):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: envoy-proxy
spec:
  template:
    spec:
      hostNetwork: true # Inherits Node DNS, losing access to coredns
      containers:
      - name: envoy
        image: envoy:distroless
        # ...

The Corrected Configuration (Standard Cluster DNS):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: envoy-proxy
spec:
  template:
    spec:
      # 1. Remove hostNetwork: true or ensure dnsPolicy is correct
      hostNetwork: false
      dnsPolicy: ClusterFirst

      containers:
      - name: envoy
        image: envoy:distroless
        startupProbe:
          httpGet:
            path: /ready
            port: 19003
          failureThreshold: 30
          periodSeconds: 2
        # ...

Alternative Fix (If HostNetwork is Mandatory):
You must explicitly define the DNS config to use the Cluster DNS service (usually .10 or .100 of the service CIDR).

spec:
  template:
    spec:
      hostNetwork: true
      dnsPolicy: "None"
      dnsConfig:
        nameservers:
          - "10.96.0.10" # Example CoreDNS Cluster IP
        searches:
          - "envoy-gateway-system.svc.cluster.local"
          - "svc.cluster.local"
          - "cluster.local"

How Senior Engineers Fix It

  1. Verify DNS Resolution Inside the Container: Immediately kubectl exec into the failing pod and attempt to resolve the control plane service using nslookup or dig.

    kubectl exec -it  -c envoy -- nslookup envoy-gateway.envoy-gateway-system.svc.cluster.local

    If this fails, the problem is DNS configuration, not Envoy.

  2. Inspect resolv.conf: Check the /etc/resolv.conf file inside the container.

    kubectl exec -it  -c envoy -- cat /etc/resolv.conf

    Compare this with the host node’s /etc/resolv.conf and the Cluster DNS Service IP. If they match the node (and the node cannot resolve internal names), the dnsPolicy is wrong.

  3. Check Pod Spec: Review the dnsPolicy and hostNetwork settings. The standard fix is to set dnsPolicy: ClusterFirst (default). If hostNetwork: true is strictly required, use dnsPolicy: "None" and manually inject the Cluster DNS IP into dnsConfig.

  4. Check xDS Connectivity: If DNS works, check if the Envoy process can reach the IP provided by DNS.

    kubectl exec -it  -c envoy -- curl -v http://:18000/v1/discovery:listeners

Why Juniors Miss It

  • Focus on Envoy Configuration: Juniors often assume the error grpc ERROR or upstream reset is caused by Envoy YAML (Bootstrap config) or xDS RBAC issues. They spend hours analyzing control plane logs instead of checking basic network connectivity.
  • Misinterpreting “DNS Success”: The Envoy log [2026-01-17...] dns resolution for envoy-gateway... completed with status 0x0000: "cares_success:Successful completion" is deceptive. It looks like DNS is working. Juniors often don’t realize that NOERROR (empty answer) is a failure for A-record lookups.
  • Overlooking hostNetwork Side Effects: Engineers enable hostNetwork: true to solve a CNI or port conflict issue but fail to recognize that this automatically disables dnsPolicy: ClusterFirst, breaking internal service discovery.