Why Your Golang Service Misses System DNS Cache and How to Fix It
This article explains why a Golang service running on AWS EC2 failed to use the system‑level DNS cache provided by nscd, causing excessive DNS queries that triggered request timeouts, and describes the investigation and optimization steps that resolved the issue.
Introduction
The Golang service deployed on an AWS EC2 instance did not benefit from the system‑level DNS cache (nscd), leading to an abnormally high DNS query rate and request timeouts.
Background
In a real‑world scenario, Business A's EC2 instance pushes data to Business B's EC2 via a domain name that resolves to a load balancer. During peak traffic, the Golang client occasionally reports request timeouts.
Investigation Scope
Requests from Business A to Business B timed out before reaching the load balancer, indicating the problem originated on Business A's EC2 server.
Abnormal Metrics
Server‑side CPU, memory, network, and bandwidth appeared normal, but the linklocal_allowance_exceeded metric was over the limit, reflecting excessive request metadata or DNS resolution attempts (requests + DNS > 1024 /s).
Root Cause Confirmation
Network captures showed that the server’s nscd service cached DNS, yet the Golang process bypassed the system resolver and performed its own DNS lookups, which were not limited. Further captures revealed DNS queries exceeding 1024 /s, confirming DNS‑rate throttling as the root cause.
Problem Analysis
Two issues were identified:
Why the nscd DNS cache was ineffective.
Why Golang’s default connection reuse was not applied.
Why DNS Cache Was Ineffective
Golang bypasses the system resolver and reads /etc/resolv.conf directly, implementing its own DNS lookup logic, so the nscd cache is never used.
Why Connection Reuse Was Disabled
The Go HTTP transport’s DisableKeepAlives flag (default false) controls connection reuse. The business code set DisableKeepAlives=true , disabling reuse.
Solution
Two possible fixes were considered:
Force Golang to use the system DNS resolver by setting the environment variable GODEBUG=netdns=cgo (non‑standard and risky).
Enable connection reuse by setting DisableKeepAlives=false or removing the flag (preferred).
Effect
After enabling connection reuse, DNS query volume dropped by about 90%, the linklocal_allowance_exceeded metric stayed within limits, and request timeouts disappeared.
Appendix
References:
Monitoring linklocal_allowance_exceeded : https://docs.aws.amazon.com/zh_cn/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html
AWS DNS rate‑limit documentation: https://docs.aws.amazon.com/zh_cn/vpc/latest/userguide/AmazonDNS-concepts.html
Golang DNS client implementation: https://go.dev/src/net/dnsclient_unix.go
Golang DisableKeepAlives parameter: https://pkg.go.dev/net/http#Transport.DisableKeepAlives
37 Interactive Technology Team
37 Interactive Technology Center
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.