Summary
A pod running Envoy 1.6.2 (data plane) in Kubernetes failed to become ready because the startup probe could not connect to the Envoy admin interface (:19003/ready). Envoy logs indicated a persistent gRPC connection failure and DNS resolution issues when trying to establish the xDS connection to the control plane (envoy-gateway.envoy-gateway-system.svc.cluster.local). The root cause was identified as a Kubernetes DNS configuration mismatch: the data plane Pod was configured to use the host node’s DNS resolver (hostNetwork: true or custom dnsConfig), which lacked access to the cluster’s internal DNS service (coredns), preventing resolution of internal ClusterIP services.
Root Cause
The fundamental issue was a DNS resolution failure for the control plane service within the Envoy container. Because the data plane pod could not resolve the internal Kubernetes service name envoy-gateway.envoy-gateway-system.svc.cluster.local to a ClusterIP, the xDS gRPC stream could not be established.
The specific sequence of failure was:
- Configuration Mismatch: The data plane pod was running with
hostNetwork: trueor a customdnsConfigthat pointed to external resolvers (e.g.,8.8.8.8). - DNS Query Failure: Envoy issued a DNS query for the control plane service. The external DNS server returned
NOERROR(successful query but empty answer) because it has no record of internal Kubernetes services. - gRPC Connection Timeout: Since no IP was resolved, Envoy attempted to connect to an undefined IP or timed out trying to establish the xDS stream.
- Startup Probe Failure: Envoy’s main thread was blocked waiting for the xDS configuration. Consequently, the admin server on port
19003was not fully initialized or accepting connections, causing the Kubernetes startup probe to fail with “connection refused.”
Why This Happens in Real Systems
This is a common anti-pattern in Service Mesh and CNI configurations where L7 proxies (like Envoy) interact with L4 networking constraints.
- Service Discovery Dependency: Envoy relies entirely on the control plane to bootstrap its configuration. The control plane is almost always an internal ClusterIP Service.
- DNS as the Bootstrap Mechanism: The very first step of Envoy’s startup is resolving the address of the xDS server. If the DNS resolver in the container cannot see the internal Kubernetes DNS (
coredns), the system deadlocks: no config means no server, and no server means the health check fails. - HostNetwork Isolation: When
hostNetwork: trueis used (often for performance or CNI reasons), the Pod inherits the Node’s/etc/resolv.conf. If the Node’s DNS is not configured to forward queries forcluster.localto the Cluster DNS IP, internal service names will never resolve.
Real-World Impact
- Service Mesh Outage: If this occurs during an upgrade or deployment, the data plane nodes effectively go dark, dropping all traffic passing through that node.
- Control Plane Blackhole: The control plane cannot push updates to the data plane, leaving Envoy stuck with stale configuration.
- Debugging Complexity: The logs show “DNS success” (because the external server responded) but “no data,” which can be misleading. Engineers often look for network blocks or firewall issues rather than DNS configuration mismatches.
Example or Code
Below is the configuration required to reproduce or fix this issue. The critical fix is ensuring dnsPolicy is set to ClusterFirst (if not using hostNetwork) or configuring dnsConfig to point to the internal CoreDNS service IP if hostNetwork is mandatory.
The Problematic Configuration (Host Network):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: envoy-proxy
spec:
template:
spec:
hostNetwork: true # Inherits Node DNS, losing access to coredns
containers:
- name: envoy
image: envoy:distroless
# ...
The Corrected Configuration (Standard Cluster DNS):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: envoy-proxy
spec:
template:
spec:
# 1. Remove hostNetwork: true or ensure dnsPolicy is correct
hostNetwork: false
dnsPolicy: ClusterFirst
containers:
- name: envoy
image: envoy:distroless
startupProbe:
httpGet:
path: /ready
port: 19003
failureThreshold: 30
periodSeconds: 2
# ...
Alternative Fix (If HostNetwork is Mandatory):
You must explicitly define the DNS config to use the Cluster DNS service (usually .10 or .100 of the service CIDR).
spec:
template:
spec:
hostNetwork: true
dnsPolicy: "None"
dnsConfig:
nameservers:
- "10.96.0.10" # Example CoreDNS Cluster IP
searches:
- "envoy-gateway-system.svc.cluster.local"
- "svc.cluster.local"
- "cluster.local"
How Senior Engineers Fix It
-
Verify DNS Resolution Inside the Container: Immediately
kubectl execinto the failing pod and attempt to resolve the control plane service usingnslookupordig.kubectl exec -it -c envoy -- nslookup envoy-gateway.envoy-gateway-system.svc.cluster.localIf this fails, the problem is DNS configuration, not Envoy.
-
Inspect
resolv.conf: Check the/etc/resolv.conffile inside the container.kubectl exec -it -c envoy -- cat /etc/resolv.confCompare this with the host node’s
/etc/resolv.confand the Cluster DNS Service IP. If they match the node (and the node cannot resolve internal names), the dnsPolicy is wrong. -
Check Pod Spec: Review the
dnsPolicyandhostNetworksettings. The standard fix is to setdnsPolicy: ClusterFirst(default). IfhostNetwork: trueis strictly required, usednsPolicy: "None"and manually inject the Cluster DNS IP intodnsConfig. -
Check xDS Connectivity: If DNS works, check if the Envoy process can reach the IP provided by DNS.
kubectl exec -it -c envoy -- curl -v http://:18000/v1/discovery:listeners
Why Juniors Miss It
- Focus on Envoy Configuration: Juniors often assume the error
grpc ERRORorupstream resetis caused by Envoy YAML (Bootstrap config) or xDS RBAC issues. They spend hours analyzing control plane logs instead of checking basic network connectivity. - Misinterpreting “DNS Success”: The Envoy log
[2026-01-17...] dns resolution for envoy-gateway... completed with status 0x0000: "cares_success:Successful completion"is deceptive. It looks like DNS is working. Juniors often don’t realize thatNOERROR(empty answer) is a failure for A-record lookups. - Overlooking
hostNetworkSide Effects: Engineers enablehostNetwork: trueto solve a CNI or port conflict issue but fail to recognize that this automatically disablesdnsPolicy: ClusterFirst, breaking internal service discovery.