Why Your Domain Suddenly Fails to Resolve: A Practical DNS Troubleshooting Guide
This guide walks you through a systematic, multi‑stage process for diagnosing and fixing DNS resolution failures, covering symptom identification, tool preparation, local resolver checks, authoritative server analysis, common root causes, advanced diagnostics, and post‑fix validation with concrete commands and examples.
Background and Symptoms
DNS (Domain Name System) is a core Internet infrastructure, and domain resolution failures are encountered daily by operations engineers. The symptoms are diverse and often misleading:
User reports "website cannot be opened" while the server is reachable via ping – typical of a DNS configuration issue despite normal resolution.
Browser shows DNS_PROBE_FINISHED_NXDOMAIN but curl can reach the IP – the authoritative DNS returns NXDOMAIN.
Mobile app works while PC reports network errors – indicates DNS cache or resolver inconsistency.
Some regions succeed while others timeout – points to incomplete DNS propagation or CDN scheduling.
After a service migration the new IP is active, yet some users still resolve the old IP – classic DNS cache problem.
Email sending fails with "Connection timed out" or "Host not found" – MX or A records are missing.
Micro‑services suddenly get Unknown host errors – internal DNS service failure.
Ping to a domain alternates between different IPs – CDN smart routing or DNS hijacking.
Core difficulty: DNS is a hierarchical cache and distributed system; the root cause may lie in the local resolver, recursive DNS, authoritative DNS, registrar, CDN, or any intermediate step. Because of caching, the same misconfiguration can manifest differently across locations and times.
The article’s goal is to build a systematic DNS troubleshooting path covering root‑cause identification, temporary fixes, verification methods, and preventive measures.
Tool Preparation
dig– detailed DNS queries (recommended, part of bind-utils). nslookup – simple DNS queries (built‑in). host – simplified DNS queries (part of bind-utils). resolvectl – query and control systemd‑resolved. /etc/resolv.conf – view local resolver configuration. systemd-resolve --statistics – DNS cache statistics. rndc – control BIND/named DNS service. tcpdump – capture DNS traffic. whois – query domain registration information.
which dig nslookup host resolvectl
dig -v
nslookup -versionPhase 1 – Confirm It Is a DNS Problem
1.1 Use ping and nslookup for a quick distinction
DNS problems affect only name resolution; direct IP connections should still work.
# Ping the domain name first
ping -c 4 www.example.com
# If ping fails, try pinging the resolved IP directly
nslookup www.example.com # assume it returns 93.184.216.34
ping -c 4 93.184.216.34
# Interpretation
# • IP reachable but domain name not → DNS issue.
# • Both unreachable → possible network or server problem.Note: ping itself performs DNS lookup. If it reports ping: unknown host, local resolution failed completely. If it shows an IP but the port is blocked, DNS works and the issue lies at the application layer.
1.2 Classify common DNS failure types
Both nslookup and dig reveal the failure category directly:
nslookup www.example.com 8.8.8.8
# Type 1: Server failed – DNS server itself is down or blocked
# Type 2: NXDOMAIN – domain does not exist or is not registered
# Type 3: Connection timed out – DNS server unreachable
# Type 4: NOERROR with empty Answer – record exists but type mismatchQuick reference: SERVFAIL – internal error on the authoritative server (DNSSEC failure, NS down). NXDOMAIN – domain does not exist (expired, mis‑configured, or not yet propagated). REFUSED – query rejected (ACL or firewall). TIMEOUT – no response (network or server down). NOERROR but empty Answer – record type mismatch (e.g., querying AAAA when only A exists).
Phase 2 – Local DNS Resolver Investigation
2.1 Examine /etc/resolv.conf
# Typical output
; generated by /usr/sbin/resolvconf, ifupdown
nameserver 10.0.0.2
nameserver 8.8.8.8
search example.com
options timeout:2 attempts:3 rotateKey points: nameserver – first server is tried first; if it fails, the next one is used. search – appends the domain when a bare hostname is queried. options timeout – reduces the wait time per query (default 5 s). attempts – number of retries (default 2). rotate – round‑robin usage of the listed nameservers.
Note: /etc/resolv.conf is often managed by NetworkManager or dhclient. Direct edits may be overwritten on network restart. To make permanent changes, adjust the network interface configuration (e.g., PEERDNS=no on RHEL/CentOS or edit /etc/systemd/resolved.conf on Ubuntu).
# RHEL/CentOS – edit /etc/sysconfig/network-scripts/ifcfg-eth0
PEERDNS=no
DNS1=8.8.8.8
DNS2=1.1.1.1
# Debian/Ubuntu – edit /etc/network/interfaces or /etc/systemd/resolved.conf
# Then restart the resolver service2.2 Check systemd‑resolved status (Ubuntu 18.04+, CentOS 8+)
# Show resolver status
resolvectl status
# Sample output (truncated)
Global
LLMNR setting: yes
MulticastDNS setting: yes
DNSSEC setting: allow-downgrade
DNSSEC supported: yes
Current DNS Server: 10.0.0.2
DNS Servers: 10.0.0.2 8.8.8.8 1.1.1.1
Link ens33
Current Scopes: DNS LLMNR/IPv4
DNSSEC supported: yes
DNS Servers: 10.0.0.2 8.8.8.8
DNS Domain: example.comCache statistics help identify whether the resolver is hitting the cache or performing external queries:
# Cache statistics
resolvectl statistics
# Sample output
Cache: 500 hits, 1200 misses
Cache size: 10000 entries (max 10000)
Current cache size: 1234 entries
Negative cache size: 234 entries
# High FAILURE count → many DNS query failures.2.3 Test a specific public DNS server
# Query Google DNS directly
dig @8.8.8.8 www.example.com +short
# Query Cloudflare DNS
dig @1.1.1.1 www.example.com +short
# Query Alibaba DNS (common in China)
dig @223.5.5.5 www.example.com +short
# Query specific record types
dig @8.8.8.8 example.com MX +short
dig @8.8.8.8 example.com NS +short
dig @8.8.8.8 example.com TXT +shortDecision rule:
If the public DNS returns the correct answer but the local resolver does not, the problem is in the local resolver configuration or network path.
If both fail, the issue lies with the authoritative DNS or the network itself.
2.4 Trace the full recursive resolution path
# Show each step from root to authoritative server
dig @8.8.8.8 www.example.com +trace
# Sample output (truncated)
. 518400 IN NS a.root-servers.net.
. 518400 IN NS b.root-servers.net.
;; Received 528 bytes
com. 172800 IN NS a.gtld-servers.net.
com. 172800 IN NS b.gtld-servers.net.
;; Received 548 bytes
example.com. 172800 IN NS a.iana-servers.net.
example.com. 172800 IN NS b.iana-servers.net.
;; Received 89 bytes
www.example.com. 300 IN A 93.184.216.34
# If the trace stalls at the TLD step, the problem is DNS propagation or TLD server configuration.Phase 3 – Authoritative DNS Server Investigation
3.1 Verify NS records
# List NS records for the domain
dig example.com NS
# Verify the A records of the NS servers themselves
dig @8.8.8.8 ns1.example.com A
dig @8.8.8.8 ns2.example.com A
# Cross‑check with registrar information
whois example.com | grep -E "Name Server|NS-Name"Common mistakes:
Registrar lists ns1.example.com but the NS server’s A record is missing or points to the wrong IP.
After a domain migration, the NS records were updated but the old NS servers are still live, causing mixed answers.
3.2 Check domain expiration or hold status
# If dig returns NXDOMAIN but you know the domain is valid:
whois example.com | grep -E "Domain Status|Expir|Domain Expir|Registrar|Registrant"
# Typical hold statuses
# clientHold – registrar has suspended the domain
# serverHold – registrar has suspended the domain
# redemptionPeriod – domain pending deletion (45 days)
# inactive – not registered or expired
# pendingDelete – will be deleted soon (usually 5 days)If the status is clientHold or serverHold, the domain must be re‑activated with the registrar; DNS configuration is not the cause.
3.3 Diagnose DNSSEC misconfiguration
DNSSEC validation failures produce SERVFAIL.
# Disable DNSSEC temporarily to see if resolution succeeds
dig @8.8.8.8 www.example.com +dnssec +noedns
# If it works, DNSSEC signatures are the problem.
# Inspect DS and DNSKEY records
dig @8.8.8.8 example.com DS +short
dig @8.8.8.8 example.com DNSKEY +shortTypical causes (as listed in the article):
Domain enabled DNSSEC but the registrar did not publish a DS record.
DNSSEC disabled on the server but RRSIG records are still being sent.
KSK rollover without proper key transition.
Fix: either correct the DS/DNSKEY chain at the registrar or temporarily disable DNSSEC until the chain is restored.
3.4 Review SOA record and refresh timers
# Retrieve the SOA record
dig example.com SOA
# Important fields:
# serial – version number, used to detect updates
# refresh – how often secondary servers query the primary (seconds)
# retry – retry interval after a failed refresh
# expire – how long a secondary keeps data without a successful refresh
# minimum – minimum TTL for authoritative answers
# Example SOA
example.com. 86400 IN SOA ns1.example.com. admin.example.com. (
2026052501 ; Serial (YYYYMMDDNN)
7200 ; Refresh (2 h)
3600 ; Retry (1 h)
1209600 ; Expire (14 d)
86400 ) ; Minimum TTL (1 d)If the serial does not increase after a migration, secondary servers will keep serving stale records. Also, a low minimum (or a high one) can override the record‑level TTL, causing cache‑related delays.
3.5 Check DNS propagation status
# Query several public resolvers and compare results
for ns in 8.8.8.8 1.1.1.1 223.5.5.5 114.114.114.114; do
echo -n "DNS $ns: ";
dig +short @$ns www.example.com;
done
# Use online tools such as https://www.whatsmydns.net/#A/www.example.com
# If results differ, propagation is still in progress (usually <10 min, DNSSEC DS may need 24‑48 h).Phase 4 – Common Root Causes and Remediation
Root Cause 1: Local DNS cache serving stale IP
Typical scenario: After moving a service to a new IP, some users still receive the old IP and see 502/504 errors.
# Compare authoritative answer with local resolver
dig @ns1.example.com www.example.com +short # authoritative
dig @10.0.0.2 www.example.com +short # local resolver
dig @8.8.8.8 www.example.com +short # public DNS
# If local answer differs, the local cache is stale.Fixes:
Clear the resolver cache (e.g., sudo systemd-resolve --flush-caches or sudo rndc flush for BIND).
Lower the domain TTL to 300 s (5 min) at least 48 h before migration, then raise it back after the change.
Perform an IP‑level smooth migration (keep old and new A records simultaneously, then drop the old one after verification).
Root Cause 2: DNS record configuration errors
Typical mistakes include:
A record points to a private IP (10/172/192.168) – external users cannot reach the service.
Mixing CNAME and A records incorrectly – the apex domain cannot be a CNAME per RFC 1032.
MX records pointing to an IP instead of a hostname.
TTL set to 0 – some resolvers mishandle zero TTL, causing stale caches.
Uneven multiple A records – load‑balancing expectations broken.
# Example of a correct apex record
example.com. 300 IN A 93.184.216.34
# Correct CNAME for a sub‑domain
www.example.com. 300 IN CNAME example.com.
# Proper MX configuration
example.com. 300 IN MX 10 mail.example.com.
mail.example.com. 300 IN A 93.184.216.34Root Cause 3: DNS propagation delay
After updating records, some resolvers still return the old data.
# Verify each NS server returns the same answer
for ns in $(dig example.com NS +short); do
echo "NS $ns:";
dig @$ns www.example.com +short;
done
# If any NS returns a different IP, propagation is incomplete.Remediation steps:
Confirm the SOA serial has incremented.
Ensure the DNS provider’s console shows "configuration effective".
If DNSSEC is used, verify the DS record is correctly published.
If after 48 h some users still see old data, contact the registrar.
Root Cause 4: NS server unavailability
Typical symptoms: registrar lists NS servers that are unreachable or mis‑configured.
# List NS records
dig example.com NS
# Test each NS for reachability
for ns in $(dig example.com NS +short); do
echo -n "Testing $ns (port 53): ";
nc -vzu $ns 53 2>&1 | head -1;
done
# Verify authoritative answers
dig @ns1.example.com www.example.com +short
dig @ns2.example.com www.example.com +shortCommon issues:
NS hostname has no A record (the server cannot be reached).
Firewall blocks UDP/TCP 53.
NS server overloaded (e.g., during DDoS).
Fixes include correcting the NS A records at the registrar, ensuring port 53 is open, or switching to a reliable third‑party DNS provider.
Root Cause 5: Network‑level blocking of DNS (UDP 53)
Symptoms: DNS works on some networks but fails behind a corporate firewall, VPN, or container network.
# Test UDP connectivity
nc -vzu 8.8.8.8 53
# If UDP fails, try TCP (DNS must support TCP)
dig @8.8.8.8 +tcp www.example.com
# Capture traffic to see if packets are dropped
sudo tcpdump -i eth0 -n port 53 -c 20
# Trigger a query while capturing
dig @8.8.8.8 www.example.comIf UDP is blocked but TCP works, the firewall is the culprit. Consider using DoH or DoT as a bypass:
# Cloudflare DoH via dig
dig @1.1.1.1 www.example.com +https
# Google DoT (requires systemd‑resolved configuration)
# Edit /etc/systemd/resolved.conf
DNS=1.1.1.1
DNSOverTLS=yes
sudo systemctl restart systemd-resolvedRoot Cause 6: Search‑engine cache serving old IP
Even when DNS is correct, browsers may still reach the old IP because search‑engine crawlers cached it.
Directly ping the domain and see the old IP, but dig returns the new IP.
Test from multiple regions; some see the new IP, others the old.
Resolution:
Submit a cache‑purge request in Baidu Search Console.
Submit a removal request in Google Search Console.
Wait 24‑72 h for the engine to recrawl.
Phase 5 – Advanced Diagnostic Tools
5.1 Capture DNS traffic with tcpdump
# Capture DNS packets (UDP 53) to a file
sudo tcpdump -i eth0 -n port 53 -w /tmp/dns_capture.pcap
# In another terminal, trigger a query
dig @8.8.8.8 www.example.com
# Stop capture (Ctrl+C) and analyse
tcpdump -r /tmp/dns_capture.pcap -n -v | head -50
# Or use tshark for a structured view
sudo tshark -i eth0 -f "port 53" -Y "dns.flags.response == 1" -T fields \
-e dns.qry.name -e dns.a -e dns.rcode -e dns.flags.rcodeKey things to look for:
Whether the request packet is sent (correct source IP).
Whether a response packet arrives (correct destination IP).
Response code (0 = NOERROR, 2 = SERVFAIL, 3 = NXDOMAIN).
Retransmissions – indicate network loss or slow server.
Round‑trip time – >1 s suggests a slow resolver.
5.2 Inspect DNS server logs (BIND example)
# View recent BIND logs
sudo journalctl -u named -n 50 --no-pager
# Enable query logging if not already on
sudo rndc querylog on
# Find the log file location (often /var/log/named/query.log)
grep "query logging" /etc/named.conf
sudo tail -f /var/log/named/query.log
# Typical error messages:
# "too many queries" – rate‑limit exceeded, increase max‑cache‑size.
# "network unreachable" – recursive lookup cannot reach upstream.
# "SERVFAIL" – DNSSEC validation failure or upstream NS not responding.5.3 Use online DNS probing services
DNSPod diagnostic: https://www.dnspod.cn/services/dns%20%E6%A3%80%E6%B5%8B
Alibaba Cloud DNS check: https://www.alidns.com/diagnose
Global propagation checker: https://www.whatsmydns.net/#A/www.example.com
Programmatic check with curl:
curl -s "https://www.dnschecker.com/?query=www.example.com&rtype=A" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+'Phase 6 – Post‑Fix Validation
After applying a fix, verify the resolution from multiple angles:
# 1. Authoritative DNS (bypass cache)
dig @ns1.example.com www.example.com +short
# 2. Public resolvers
dig @8.8.8.8 www.example.com +short
dig @1.1.1.1 www.example.com +short
dig @223.5.5.5 www.example.com +short
# 3. Local resolver (if you have one)
dig @10.0.0.2 www.example.com +short
# 4. Cross‑tool verification
nslookup www.example.com
host www.example.com
resolvectl query www.example.com
# 5. Ensure all NS servers agree
for ns in $(dig example.com NS +short); do
echo "NS $ns:";
dig @$ns www.example.com +short;
done
# 6. If the service is HTTP, check accessibility
curl -I https://www.example.com \
-w "
HTTP Code: %{http_code}
DNS Lookup: %{time_namelookup}s
Total: %{time_total}s
" \
--connect-timeout 5 --max-time 10
# 7. For mail services, verify MX and SMTP connectivity
dig @8.8.8.8 example.com MX +short
nc -vz mail.example.com 25
# 8. If DNSSEC is enabled, validate the chain
dig @8.8.8.8 www.example.com +dnssec +shortProduction DNS Operational Guidelines
Preparation before any DNS change
Backup current configuration : screenshot all records in the registrar console or export the zone file (e.g., sudo rndc zonestatus example.com and sudo named-checkzone).
Choose a low‑traffic window : typically late night or early morning; DNS changes may take hours to propagate globally.
Define a rollback plan : keep the original zone file and a one‑click rollback script.
Notify stakeholders : DNS changes can affect multiple services.
Risk‑control measures
Add a test record first : create test.example.com, verify it resolves, then modify the production record.
Add‑then‑remove principle : add the new IP while keeping the old one, verify traffic, then delete the old IP.
TTL strategy :
Normal operation: TTL = 3600 s (1 h).
Pre‑migration (48 h window): TTL = 300 s (5 min) to accelerate cache expiry.
Post‑migration: raise TTL back to 3600 s.
Prohibited actions in production DNS
Deleting NS records without fully understanding impact.
Setting TTL to 0 unless you are certain all resolvers handle it correctly.
Having different answers on different NS servers – they must stay consistent.
Editing zone files directly with a text editor and overwriting them without using rndc or API.
Bulk modifying many records at once; change in batches and verify each batch.
Failure‑Postmortem Template
【故障复盘】DNS 解析故障
故障域名:
发现时间:
影响范围(哪些用户、哪些服务):
持续时长:
故障现象:
- dig/nslookup 结果:
- HTTP 访问结果:
- 不同 DNS 服务器查询结果对比:
排查过程:
1. [命令] 确认问题类型:[NXDOMAIN / SERVFAIL / 超时 / IP 不一致]
2. [命令] 排除本地问题:[测试结果,如 @8.8.8.8 查询正常]
3. [命令] 检查权威 DNS:[测试结果]
4. [命令] 检查 whois 域名状态:[状态结果]
5. [命令] 检查 DNS 传播:[各 NS 查询结果对比]
6. [命令] 检查 DNSSEC:[测试结果]
根因:[具体说明]
修复措施:
1. [具体操作内容]
2. [具体操作内容]
验证结果:
- 权威 DNS 验证:
- 公共 DNS 验证:
- HTTP 访问验证:
预防措施:
1. 常态化 DNS 健康检查(监控 NS 可用性、SOA serial 变化)
2. 域名到期前提醒
3. DNSSEC 到期前提醒
4. ...CDN‑Specific DNS Issues
When a CDN is in use, DNS does more than map a name to an IP – it also performs geo‑based scheduling. Problems can arise from mis‑configured CNAMEs or origin settings.
CDN scheduling failure causing slow access
Typical symptom: users experience very slow page loads, yet dig returns a valid IP. The returned IP may belong to a distant CDN node or an unavailable edge.
# Compare resolution from different locations
dig @ns.bjtelecom.net www.example.com +short # Beijing DNS
dig @ns.shtelecom.com www.example.com +short # Shanghai DNS
dig @8.8.8.8 www.example.com +short # Global DNS
# For Cloudflare, IPs are in 104.x.x.x – 172.x.x.x range.
# For Alibaba CDN, IPs belong to Alibaba public ranges.
# Verify the CNAME points to the CDN‑provided domain
dig www.example.com CNAME +short
# Example of a correct CDN CNAME
# www.example.com CNAME www.example.com.w.kunluncan.com.Fixes include correcting the CNAME to the CDN‑provided hostname and ensuring the origin IP/hostname is reachable.
DNSSEC deep‑dive troubleshooting
DNSSEC ensures authenticity of DNS data. Misconfiguration leads to SERVFAIL. The article provides a step‑by‑step verification:
# 1. Does the domain have DNSSEC?
dig @8.8.8.8 example.com DNSKEY +short
# 2. Retrieve DS record from the parent zone
dig @8.8.8.8 example.com DS +short
# 3. Verify RRSIG signatures
dig @8.8.8.8 +dnssec www.example.com A
# 4. Use drill or delv for full validation
drill -D www.example.com
delv -d www.example.com
# 5. Check that the resolver trusts the root DNSKEY (usually pre‑installed).
# For custom resolvers, ensure trusted‑keys are configured.
grep -r "trusted-keys" /etc/named.confIf DNSSEC is the culprit, the quickest mitigation is to disable DNSSEC at the registrar, verify service restoration, then correctly re‑publish the DS record and DNSKEY chain.
Summary of the DNS Troubleshooting Path
Ping/IP direct test → dig/nslookup (type, error code, answer) →
Query different DNS servers (local vs public) → dig +trace (full recursion) →
whois (domain hold, expiration) → tcpdump (packet flow) →
Root‑cause remediation (cache, config error, propagation, network block) →
Multi‑point verification (authoritative, public, local, multiple tools)Key DNS failure types and quick checks: NXDOMAIN – check registrar status with whois. SERVFAIL – disable DNSSEC temporarily, inspect server logs. REFUSED – verify ACLs and firewall rules. TIMEOUT – test UDP/TCP connectivity with nc and dig +tcp.
IP mismatch across regions – investigate propagation and NS consistency.
Partial user impact – use global probing tools to locate regional DNS or CDN issues.
By following a layered approach—local resolver, recursive path, authoritative servers, registrar, DNSSEC, and network—you can pinpoint most DNS problems within 5–15 minutes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
