How to Diagnose Kubernetes Pod Network Issues: Tools, Models, and Real‑World Cases
This article introduces a systematic approach for troubleshooting Kubernetes pod network problems, covering common failure models, essential diagnostic tools such as tcpdump, nsenter, paping and mtr, and detailed case studies that illustrate step‑by‑step analysis and resolution techniques.
1. Pod Network Anomalies
Network anomalies can be classified into several categories:
Network unreachable – ping fails, caused by firewall rules, incorrect routing, high system load, or link failures.
Port unreachable – ping works but telnet fails, caused by firewall restrictions, high load, or the application not listening.
DNS resolution failure – domain names cannot be resolved while IP connectivity works, caused by incorrect pod DNS settings, DNS service issues, or communication problems with the DNS service.
Large packet loss – small packets succeed but large packets are dropped, often due to MTU mismatches; you can test with
ping -s.
CNI plugin issues – node can communicate but pods cannot reach cluster addresses, often due to kube‑proxy or CIDR exhaustion.
The overall classification is illustrated in the following diagram:
In summary, the most common pod network failures are network unreachable, port unreachable, DNS resolution errors, and large‑packet loss.
2. Common Network Diagnostic Tools
tcpdump
tcpdump is a powerful packet capture tool. Installation commands:
Ubuntu/Debian:
apt-get install -y tcpdumpCentOS/Fedora:
yum install -y tcpdumpAlpine:
apk add tcpdump --no-cacheTypical usage examples:
<code>tcpdump -D</code> <code>tcpdump host 1.1.1.1</code> <code>tcpdump src|dst 1.1.1.1</code> <code>tcpdump net 1.2.3.0/24</code> <code>tcpdump -c 1 -X icmp</code> <code>tcpdump port 3389</code> <code>tcpdump portrange 21-23</code> <code>tcpdump less 32</code> <code>tcpdump greater 64</code> <code>tcpdump -w capture_file</code>Logical operators can be combined, e.g.:
<code>tcpdump -i eth0 -nn host 220.181.57.216 and 10.0.0.1</code> <code>tcpdump -i eth0 -nn host 220.181.57.216 or 10.0.0.1</code> <code>tcpdump -i eth0 -nn host 10.0.0.1 and (10.0.0.9 or 10.0.0.3)</code>TCP flag filters (RST, SYN, ACK, etc.) are also supported.
nsenter
nsenter allows entering a process’s network namespace. Example syntax:
<code>nsenter -t <pid> -n <command></code>To inspect a pod’s network from the host:
<code># Find the pod’s PID</code> <code>ps -ef|grep tail</code> <code># Enter the namespace</code> <code>nsenter -t 30858 -n ifconfig</code>paping
paping continuously pings a target TCP port, useful for testing connectivity and packet loss.
<code>paping -p 80 -c 10 example.com</code>Installation dependencies vary by OS (e.g., libstdc++.i686 on RHEL/CentOS).
mtr
mtr combines traceroute and ping, providing loss percentage, latency statistics, and more.
<code>mtr google.com</code> <code>mtr -n google.com</code> <code>mtr -b google.com</code> <code>mtr -c 5 google.com</code>Key columns: last, avg, best, worst, stdev. Loss% > 0 indicates possible issues; high stdev suggests unstable latency.
Tips: For more network tools, refer to additional resources.
3. Pod Network Troubleshooting Workflow
The troubleshooting process follows the diagram below:
Pod network troubleshooting idea
4. Case Studies
Node Expansion – Service Unreachable
After adding a new worker node, the node could not reach the ClusterIP of a registry service, while other nodes worked fine.
Investigation steps:
Verified CNI plugin (flannel vxlan) and kube‑proxy (iptables) were functioning.
Confirmed the registry pod itself was reachable.
Checked iptables NAT rules – they were correct.
Examined routing tables; the problematic node had two IP addresses on the same NIC (static + DHCP), causing IP conflict.
Resolution: Removed the DHCP configuration (set BOOTPROTO="none"), then restarted Docker and kubelet.
External Cloud Host – Timeout
A cloud VM could telnet to a NodePort service but HTTP POST requests timed out.
Analysis revealed large packets (>1400 bytes) were repeatedly retransmitted due to MTU mismatch (host MTU 1500 vs. Calico tunnel MTU 1440).
Fix: Align MTU values by setting the host NIC to 1440 or adjusting Calico’s MTU to 1500.
Pod Accessing Object Storage – DNS Timeout
Pods could reach the storage IP but failed DNS resolution for the storage domain.
Root cause: kube‑proxy pods on newly added nodes were pending because they lacked the highest priority class; when resources were scarce, kube‑proxy was evicted, breaking service/DNS access.
Solution: Assign
system-node-criticalpriority class to kube‑proxy and add readiness probes for dependent pods.
Source: https://www.cnblogs.com/Cylon/p/16611503.html
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.