certain pods resolving everything to 15.197.172.60

Summary

The issue at hand involves K3S clusters experiencing DNS resolution problems, where all hostnames are being resolved to the IP address 15.197.172.60, which corresponds to an Amazon Global Accelerator. This results in ArgoCD being unable to contact github.com and other services failing to establish connections due to TLS handshake failures.

Root Cause

The root cause of this issue is related to misconfigured DNS settings. Key factors include:

  • Wildcard entries in the DHCP server that may be causing DNS queries to be resolved incorrectly
  • Complex resolv.conf files that can lead to unexpected DNS resolution behavior
  • Inconsistent DNS settings across nodes in a multi-node cluster, which can cause some pods to resolve hostnames incorrectly

Why This Happens in Real Systems

This issue occurs in real systems due to:

  • Inadequate DNS configuration: Failing to properly configure DNS settings can lead to unexpected resolution behavior
  • Network complexity: Multi-node clusters and complex network setups can increase the likelihood of DNS resolution issues
  • Dependency on external services: Relying on external services like Amazon Global Accelerator can introduce additional points of failure

Real-World Impact

The real-world impact of this issue includes:

  • Service disruptions: Inability to establish connections to external services due to DNS resolution failures
  • Security risks: Potential security vulnerabilities due to TLS handshake failures and unrecognized names
  • Debugging challenges: Difficulty in identifying and resolving the root cause of the issue due to complex network and DNS configurations

Example or Code

dig -x 15.197.172.60 +short
# Output: a63452c77db78f54b.awsglobalaccelerator.com.

kubectl port-forward -n kube-system svc/kube-dns 1053:53
# Forwarding from 127.0.0.1:1053 -> 53
# Forwarding from [::1]:1053 -> 53

dig @127.0.0.1 +tcp -p1053 apple.com +short
# Output: 17.253.144.10

How Senior Engineers Fix It

Senior engineers fix this issue by:

  • Simplifying resolv.conf files: Using a simple resolv.conf file with reliable nameservers like 1.1.1.1 and 8.8.8.8
  • Eliminating wildcard entries: Removing wildcard entries from the DHCP server to prevent incorrect DNS resolution
  • Ensuring consistent DNS settings: Configuring consistent DNS settings across all nodes in a multi-node cluster

Why Juniors Miss It

Junior engineers may miss this issue due to:

  • Lack of understanding of DNS configuration: Inadequate knowledge of DNS settings and their impact on network behavior
  • Insufficient experience with complex networks: Limited experience with multi-node clusters and complex network setups
  • Overlooking critical details: Failing to notice critical details like wildcard entries and inconsistent DNS settings