Analyzing Amazon EKS Network Call Chains with Amazon Q CLI and Optimizing Common Issues
This article demonstrates how to use Amazon Q CLI's natural‑language interface to dissect the full network path from an ALB through EC2 node iptables to a pod in an Amazon EKS cluster, then examines four typical networking problems and provides concrete mitigation strategies.
Amazon EKS and ALB Ingress architecture
Amazon EKS consists of a managed control plane and EC2 worker nodes. The VPC CNI plugin assigns a VPC IP to each pod. The Amazon Load Balancer Controller creates an Application Load Balancer (ALB) for Ingress resources.
Network modes
Instance mode : ALB forwards traffic to the node’s NodePort. iptables DNAT rewrites the destination to the pod IP; source IP is SNAT‑ed to the node IP.
IP mode : ALB forwards directly to the pod IP, bypassing kube-proxy and NodePort, preserving the client source IP.
Call‑chain analysis (Instance mode)
Client resolves the ALB DNS name and completes a TCP three‑way handshake.
ALB forwards the request to the target EC2 instance’s NodePort.
Node iptables PREROUTING jumps to the KUBE‑SERVICES chain. KUBE‑SERVICES identifies the traffic as a NodePort and forwards it to KUBE‑NODEPORTS. KUBE‑NODEPORTS matches the port and jumps to a service‑specific chain (e.g., KUBE‑EXT‑LV7ZA3PWSGGQGJTO).
The service chain marks the packet for SNAT and forwards it to KUBE‑SVC‑LV7ZA3PWSGGQGJTO, where a probabilistic rule selects a backend pod.
In the selected pod’s endpoint chain, DNAT rewrites the destination to the pod IP and port. POSTROUTING performs SNAT so that the response can be routed back through the ALB.
The Linux conntrack module records the NAT state, enabling reverse translation for the response.
Common network issues and mitigation
Pod load imbalance (iptables mode)
iptables mode uses connection‑level random selection, which can concentrate traffic on a few pods under high concurrency.
Switch kube-proxy to IPVS mode to use hash tables and algorithms such as Round‑Robin or Weighted Round‑Robin.
conntrack table overflow
When the number of tracked connections exceeds net.netfilter.nf_conntrack_max, new connections are dropped and “nf_conntrack: table full” messages appear.
Diagnose current usage:
# 查看当前conntrack表使用情况
sudo sysctl net.netfilter.nf_conntrack_count
# 查看最大conntrack表大小
sudo sysctl net.netfilter.nf_conntrack_max
# 检查溢出日志
sudo dmesg | grep conntrackIncrease the table size (temporary):
sudo sysctl -w net.netfilter.nf_conntrack_max=1048576
sudo sysctl -w net.netfilter.nf_conntrack_buckets=262144Persist the change:
echo 'net.netfilter.nf_conntrack_max=1048576' | sudo tee -a /etc/sysctl.conf
echo 'net.netfilter.nf_conntrack_buckets=262144' | sudo tee -a /etc/sysctl.confAdjust timeouts, e.g.:
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30Periodically clean entries for a specific service port:
sudo conntrack -D -p tcp --dport <service-port>5xx errors during rolling updates
When a pod is terminated, the ALB may still route requests to it because the “draining” state propagation is delayed, causing 502/504 responses.
Add a preStop hook to delay container exit:
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 30"]Increase terminationGracePeriodSeconds to give the pod more time to finish in‑flight requests.
Reduce the ALB deregistration delay, e.g.:
annotations:
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30Enable Pod Readiness Gates so the pod is marked ready only after successful registration with the ALB.
Service selector update lag
Updating a Service selector rewrites iptables rules, but existing conntrack entries continue to route traffic to the old pods until they expire.
Clear stale conntrack entries on the nodes.
Use a blue‑green deployment: create a new Service for the new deployment and switch the Ingress to the new Service, avoiding selector changes.
iptables chain walk‑through for a NodePort service
For a request arriving at NodePort 31564, the processing sequence is:
ALB sends traffic to the node’s port 31564. PREROUTING forwards to KUBE‑SERVICES. KUBE‑SERVICES identifies NodePort traffic and jumps to KUBE‑NODEPORTS. KUBE‑NODEPORTS matches port 31564 and jumps to KUBE‑EXT‑LV7ZA3PWSGGQGJTO.
The service chain marks the packet for SNAT and forwards to KUBE‑SVC‑LV7ZA3PWSGGQGJTO. KUBE‑SVC‑LV7ZA3PWSGGQGJTO selects a backend pod using a random statistic rule (probability 0.5) and jumps to the corresponding KUBE‑SEP‑… chain.
DNAT in the endpoint chain rewrites the destination to the pod IP and port. POSTROUTING performs SNAT so that the response returns via the ALB.
conntrack role in Kubernetes Service networking
conntrack maintains a state table for every connection, enabling:
Tracking of source/destination IPs, ports, protocol, and connection state (NEW, ESTABLISHED, RELATED, INVALID).
Support for NAT: iptables DNAT for inbound traffic is recorded, and conntrack automatically performs reverse NAT for outbound packets.
Session affinity: repeated requests from the same client can be directed to the same pod.
Challenges include table capacity limits, stale connections after Service selector changes, and performance overhead under heavy load.
Summary of mitigation actions
Switch kube-proxy to IPVS for more deterministic load balancing.
Tune conntrack parameters ( nf_conntrack_max, timeouts) and periodically purge stale entries.
Use preStop hooks, increase terminationGracePeriodSeconds, reduce ALB deregistration delay, and enable Pod Readiness Gates to avoid 5xx spikes during rolling updates.
Prefer blue‑green Service/Ingress changes to eliminate selector‑lag issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Amazon Cloud Developers
Official technical community of Amazon Cloud. Shares practical AI/ML, big data, database, modern app development, IoT content, offers comprehensive learning resources, hosts regular developer events, and continuously empowers developers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
