Operations 49 min read

Why Your Domain Suddenly Fails to Resolve: A Practical DNS Troubleshooting Guide

This guide walks you through a systematic, multi‑stage process for diagnosing and fixing DNS resolution failures, covering symptom identification, tool preparation, local resolver checks, authoritative server analysis, common root causes, advanced diagnostics, and post‑fix validation with concrete commands and examples.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Why Your Domain Suddenly Fails to Resolve: A Practical DNS Troubleshooting Guide

Background and Symptoms

DNS (Domain Name System) is a core Internet infrastructure, and domain resolution failures are encountered daily by operations engineers. The symptoms are diverse and often misleading:

User reports "website cannot be opened" while the server is reachable via ping – typical of a DNS configuration issue despite normal resolution.

Browser shows DNS_PROBE_FINISHED_NXDOMAIN but curl can reach the IP – the authoritative DNS returns NXDOMAIN.

Mobile app works while PC reports network errors – indicates DNS cache or resolver inconsistency.

Some regions succeed while others timeout – points to incomplete DNS propagation or CDN scheduling.

After a service migration the new IP is active, yet some users still resolve the old IP – classic DNS cache problem.

Email sending fails with "Connection timed out" or "Host not found" – MX or A records are missing.

Micro‑services suddenly get Unknown host errors – internal DNS service failure.

Ping to a domain alternates between different IPs – CDN smart routing or DNS hijacking.

Core difficulty: DNS is a hierarchical cache and distributed system; the root cause may lie in the local resolver, recursive DNS, authoritative DNS, registrar, CDN, or any intermediate step. Because of caching, the same misconfiguration can manifest differently across locations and times.

The article’s goal is to build a systematic DNS troubleshooting path covering root‑cause identification, temporary fixes, verification methods, and preventive measures.

Tool Preparation

dig

– detailed DNS queries (recommended, part of bind-utils). nslookup – simple DNS queries (built‑in). host – simplified DNS queries (part of bind-utils). resolvectl – query and control systemd‑resolved. /etc/resolv.conf – view local resolver configuration. systemd-resolve --statistics – DNS cache statistics. rndc – control BIND/named DNS service. tcpdump – capture DNS traffic. whois – query domain registration information.

which dig nslookup host resolvectl

dig -v

nslookup -version

Phase 1 – Confirm It Is a DNS Problem

1.1 Use ping and nslookup for a quick distinction

DNS problems affect only name resolution; direct IP connections should still work.

# Ping the domain name first
ping -c 4 www.example.com

# If ping fails, try pinging the resolved IP directly
nslookup www.example.com   # assume it returns 93.184.216.34
ping -c 4 93.184.216.34

# Interpretation
# • IP reachable but domain name not → DNS issue.
# • Both unreachable → possible network or server problem.

Note: ping itself performs DNS lookup. If it reports ping: unknown host, local resolution failed completely. If it shows an IP but the port is blocked, DNS works and the issue lies at the application layer.

1.2 Classify common DNS failure types

Both nslookup and dig reveal the failure category directly:

nslookup www.example.com 8.8.8.8
# Type 1: Server failed – DNS server itself is down or blocked
# Type 2: NXDOMAIN – domain does not exist or is not registered
# Type 3: Connection timed out – DNS server unreachable
# Type 4: NOERROR with empty Answer – record exists but type mismatch

Quick reference: SERVFAIL – internal error on the authoritative server (DNSSEC failure, NS down). NXDOMAIN – domain does not exist (expired, mis‑configured, or not yet propagated). REFUSED – query rejected (ACL or firewall). TIMEOUT – no response (network or server down). NOERROR but empty Answer – record type mismatch (e.g., querying AAAA when only A exists).

Phase 2 – Local DNS Resolver Investigation

2.1 Examine /etc/resolv.conf

# Typical output
; generated by /usr/sbin/resolvconf, ifupdown
nameserver 10.0.0.2
nameserver 8.8.8.8
search example.com
options timeout:2 attempts:3 rotate

Key points: nameserver – first server is tried first; if it fails, the next one is used. search – appends the domain when a bare hostname is queried. options timeout – reduces the wait time per query (default 5 s). attempts – number of retries (default 2). rotate – round‑robin usage of the listed nameservers.

Note: /etc/resolv.conf is often managed by NetworkManager or dhclient. Direct edits may be overwritten on network restart. To make permanent changes, adjust the network interface configuration (e.g., PEERDNS=no on RHEL/CentOS or edit /etc/systemd/resolved.conf on Ubuntu).

# RHEL/CentOS – edit /etc/sysconfig/network-scripts/ifcfg-eth0
PEERDNS=no
DNS1=8.8.8.8
DNS2=1.1.1.1

# Debian/Ubuntu – edit /etc/network/interfaces or /etc/systemd/resolved.conf
# Then restart the resolver service

2.2 Check systemd‑resolved status (Ubuntu 18.04+, CentOS 8+)

# Show resolver status
resolvectl status

# Sample output (truncated)
Global
  LLMNR setting: yes
  MulticastDNS setting: yes
  DNSSEC setting: allow-downgrade
  DNSSEC supported: yes
  Current DNS Server: 10.0.0.2
  DNS Servers: 10.0.0.2 8.8.8.8 1.1.1.1

Link ens33
  Current Scopes: DNS LLMNR/IPv4
  DNSSEC supported: yes
  DNS Servers: 10.0.0.2 8.8.8.8
  DNS Domain: example.com

Cache statistics help identify whether the resolver is hitting the cache or performing external queries:

# Cache statistics
resolvectl statistics

# Sample output
Cache: 500 hits, 1200 misses
Cache size: 10000 entries (max 10000)
Current cache size: 1234 entries
Negative cache size: 234 entries

# High FAILURE count → many DNS query failures.

2.3 Test a specific public DNS server

# Query Google DNS directly
 dig @8.8.8.8 www.example.com +short

# Query Cloudflare DNS
 dig @1.1.1.1 www.example.com +short

# Query Alibaba DNS (common in China)
 dig @223.5.5.5 www.example.com +short

# Query specific record types
 dig @8.8.8.8 example.com MX +short
 dig @8.8.8.8 example.com NS +short
 dig @8.8.8.8 example.com TXT +short

Decision rule:

If the public DNS returns the correct answer but the local resolver does not, the problem is in the local resolver configuration or network path.

If both fail, the issue lies with the authoritative DNS or the network itself.

2.4 Trace the full recursive resolution path

# Show each step from root to authoritative server
 dig @8.8.8.8 www.example.com +trace

# Sample output (truncated)
.                     518400 IN NS a.root-servers.net.
.                     518400 IN NS b.root-servers.net.
;; Received 528 bytes

com.                  172800 IN NS a.gtld-servers.net.
com.                  172800 IN NS b.gtld-servers.net.
;; Received 548 bytes

example.com.          172800 IN NS a.iana-servers.net.
example.com.          172800 IN NS b.iana-servers.net.
;; Received 89 bytes

www.example.com.      300 IN A 93.184.216.34

# If the trace stalls at the TLD step, the problem is DNS propagation or TLD server configuration.

Phase 3 – Authoritative DNS Server Investigation

3.1 Verify NS records

# List NS records for the domain
 dig example.com NS

# Verify the A records of the NS servers themselves
 dig @8.8.8.8 ns1.example.com A
 dig @8.8.8.8 ns2.example.com A

# Cross‑check with registrar information
 whois example.com | grep -E "Name Server|NS-Name"

Common mistakes:

Registrar lists ns1.example.com but the NS server’s A record is missing or points to the wrong IP.

After a domain migration, the NS records were updated but the old NS servers are still live, causing mixed answers.

3.2 Check domain expiration or hold status

# If dig returns NXDOMAIN but you know the domain is valid:
 whois example.com | grep -E "Domain Status|Expir|Domain Expir|Registrar|Registrant"

# Typical hold statuses
#   clientHold – registrar has suspended the domain
#   serverHold – registrar has suspended the domain
#   redemptionPeriod – domain pending deletion (45 days)
#   inactive – not registered or expired
#   pendingDelete – will be deleted soon (usually 5 days)

If the status is clientHold or serverHold, the domain must be re‑activated with the registrar; DNS configuration is not the cause.

3.3 Diagnose DNSSEC misconfiguration

DNSSEC validation failures produce SERVFAIL.

# Disable DNSSEC temporarily to see if resolution succeeds
 dig @8.8.8.8 www.example.com +dnssec +noedns

# If it works, DNSSEC signatures are the problem.

# Inspect DS and DNSKEY records
 dig @8.8.8.8 example.com DS +short
 dig @8.8.8.8 example.com DNSKEY +short

Typical causes (as listed in the article):

Domain enabled DNSSEC but the registrar did not publish a DS record.

DNSSEC disabled on the server but RRSIG records are still being sent.

KSK rollover without proper key transition.

Fix: either correct the DS/DNSKEY chain at the registrar or temporarily disable DNSSEC until the chain is restored.

3.4 Review SOA record and refresh timers

# Retrieve the SOA record
 dig example.com SOA

# Important fields:
# serial – version number, used to detect updates
# refresh – how often secondary servers query the primary (seconds)
# retry – retry interval after a failed refresh
# expire – how long a secondary keeps data without a successful refresh
# minimum – minimum TTL for authoritative answers

# Example SOA
example.com. 86400 IN SOA ns1.example.com. admin.example.com. (
        2026052501 ; Serial (YYYYMMDDNN)
        7200       ; Refresh (2 h)
        3600       ; Retry (1 h)
        1209600    ; Expire (14 d)
        86400 )    ; Minimum TTL (1 d)

If the serial does not increase after a migration, secondary servers will keep serving stale records. Also, a low minimum (or a high one) can override the record‑level TTL, causing cache‑related delays.

3.5 Check DNS propagation status

# Query several public resolvers and compare results
 for ns in 8.8.8.8 1.1.1.1 223.5.5.5 114.114.114.114; do
   echo -n "DNS $ns: ";
   dig +short @$ns www.example.com;
 done

# Use online tools such as https://www.whatsmydns.net/#A/www.example.com
# If results differ, propagation is still in progress (usually <10 min, DNSSEC DS may need 24‑48 h).

Phase 4 – Common Root Causes and Remediation

Root Cause 1: Local DNS cache serving stale IP

Typical scenario: After moving a service to a new IP, some users still receive the old IP and see 502/504 errors.

# Compare authoritative answer with local resolver
 dig @ns1.example.com www.example.com +short   # authoritative
 dig @10.0.0.2 www.example.com +short        # local resolver
 dig @8.8.8.8 www.example.com +short         # public DNS

# If local answer differs, the local cache is stale.

Fixes:

Clear the resolver cache (e.g., sudo systemd-resolve --flush-caches or sudo rndc flush for BIND).

Lower the domain TTL to 300 s (5 min) at least 48 h before migration, then raise it back after the change.

Perform an IP‑level smooth migration (keep old and new A records simultaneously, then drop the old one after verification).

Root Cause 2: DNS record configuration errors

Typical mistakes include:

A record points to a private IP (10/172/192.168) – external users cannot reach the service.

Mixing CNAME and A records incorrectly – the apex domain cannot be a CNAME per RFC 1032.

MX records pointing to an IP instead of a hostname.

TTL set to 0 – some resolvers mishandle zero TTL, causing stale caches.

Uneven multiple A records – load‑balancing expectations broken.

# Example of a correct apex record
example.com. 300 IN A 93.184.216.34

# Correct CNAME for a sub‑domain
www.example.com. 300 IN CNAME example.com.

# Proper MX configuration
example.com. 300 IN MX 10 mail.example.com.
mail.example.com. 300 IN A 93.184.216.34

Root Cause 3: DNS propagation delay

After updating records, some resolvers still return the old data.

# Verify each NS server returns the same answer
 for ns in $(dig example.com NS +short); do
   echo "NS $ns:";
   dig @$ns www.example.com +short;
 done

# If any NS returns a different IP, propagation is incomplete.

Remediation steps:

Confirm the SOA serial has incremented.

Ensure the DNS provider’s console shows "configuration effective".

If DNSSEC is used, verify the DS record is correctly published.

If after 48 h some users still see old data, contact the registrar.

Root Cause 4: NS server unavailability

Typical symptoms: registrar lists NS servers that are unreachable or mis‑configured.

# List NS records
 dig example.com NS

# Test each NS for reachability
 for ns in $(dig example.com NS +short); do
   echo -n "Testing $ns (port 53): ";
   nc -vzu $ns 53 2>&1 | head -1;
 done

# Verify authoritative answers
 dig @ns1.example.com www.example.com +short
 dig @ns2.example.com www.example.com +short

Common issues:

NS hostname has no A record (the server cannot be reached).

Firewall blocks UDP/TCP 53.

NS server overloaded (e.g., during DDoS).

Fixes include correcting the NS A records at the registrar, ensuring port 53 is open, or switching to a reliable third‑party DNS provider.

Root Cause 5: Network‑level blocking of DNS (UDP 53)

Symptoms: DNS works on some networks but fails behind a corporate firewall, VPN, or container network.

# Test UDP connectivity
 nc -vzu 8.8.8.8 53

# If UDP fails, try TCP (DNS must support TCP)
 dig @8.8.8.8 +tcp www.example.com

# Capture traffic to see if packets are dropped
 sudo tcpdump -i eth0 -n port 53 -c 20

# Trigger a query while capturing
 dig @8.8.8.8 www.example.com

If UDP is blocked but TCP works, the firewall is the culprit. Consider using DoH or DoT as a bypass:

# Cloudflare DoH via dig
 dig @1.1.1.1 www.example.com +https

# Google DoT (requires systemd‑resolved configuration)
 # Edit /etc/systemd/resolved.conf
 DNS=1.1.1.1
 DNSOverTLS=yes
 sudo systemctl restart systemd-resolved

Root Cause 6: Search‑engine cache serving old IP

Even when DNS is correct, browsers may still reach the old IP because search‑engine crawlers cached it.

Directly ping the domain and see the old IP, but dig returns the new IP.

Test from multiple regions; some see the new IP, others the old.

Resolution:

Submit a cache‑purge request in Baidu Search Console.

Submit a removal request in Google Search Console.

Wait 24‑72 h for the engine to recrawl.

Phase 5 – Advanced Diagnostic Tools

5.1 Capture DNS traffic with tcpdump

# Capture DNS packets (UDP 53) to a file
 sudo tcpdump -i eth0 -n port 53 -w /tmp/dns_capture.pcap

# In another terminal, trigger a query
 dig @8.8.8.8 www.example.com

# Stop capture (Ctrl+C) and analyse
 tcpdump -r /tmp/dns_capture.pcap -n -v | head -50

# Or use tshark for a structured view
 sudo tshark -i eth0 -f "port 53" -Y "dns.flags.response == 1" -T fields \
   -e dns.qry.name -e dns.a -e dns.rcode -e dns.flags.rcode

Key things to look for:

Whether the request packet is sent (correct source IP).

Whether a response packet arrives (correct destination IP).

Response code (0 = NOERROR, 2 = SERVFAIL, 3 = NXDOMAIN).

Retransmissions – indicate network loss or slow server.

Round‑trip time – >1 s suggests a slow resolver.

5.2 Inspect DNS server logs (BIND example)

# View recent BIND logs
 sudo journalctl -u named -n 50 --no-pager

# Enable query logging if not already on
 sudo rndc querylog on

# Find the log file location (often /var/log/named/query.log)
 grep "query logging" /etc/named.conf
 sudo tail -f /var/log/named/query.log

# Typical error messages:
#   "too many queries" – rate‑limit exceeded, increase max‑cache‑size.
#   "network unreachable" – recursive lookup cannot reach upstream.
#   "SERVFAIL" – DNSSEC validation failure or upstream NS not responding.

5.3 Use online DNS probing services

DNSPod diagnostic: https://www.dnspod.cn/services/dns%20%E6%A3%80%E6%B5%8B

Alibaba Cloud DNS check: https://www.alidns.com/diagnose

Global propagation checker: https://www.whatsmydns.net/#A/www.example.com

Programmatic check with curl:

curl -s "https://www.dnschecker.com/?query=www.example.com&rtype=A" | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+'

Phase 6 – Post‑Fix Validation

After applying a fix, verify the resolution from multiple angles:

# 1. Authoritative DNS (bypass cache)
 dig @ns1.example.com www.example.com +short

# 2. Public resolvers
 dig @8.8.8.8 www.example.com +short
 dig @1.1.1.1 www.example.com +short
 dig @223.5.5.5 www.example.com +short

# 3. Local resolver (if you have one)
 dig @10.0.0.2 www.example.com +short

# 4. Cross‑tool verification
 nslookup www.example.com
 host www.example.com
 resolvectl query www.example.com

# 5. Ensure all NS servers agree
 for ns in $(dig example.com NS +short); do
   echo "NS $ns:";
   dig @$ns www.example.com +short;
 done

# 6. If the service is HTTP, check accessibility
 curl -I https://www.example.com \
   -w "
HTTP Code: %{http_code}
DNS Lookup: %{time_namelookup}s
Total: %{time_total}s
" \
   --connect-timeout 5 --max-time 10

# 7. For mail services, verify MX and SMTP connectivity
 dig @8.8.8.8 example.com MX +short
 nc -vz mail.example.com 25

# 8. If DNSSEC is enabled, validate the chain
 dig @8.8.8.8 www.example.com +dnssec +short

Production DNS Operational Guidelines

Preparation before any DNS change

Backup current configuration : screenshot all records in the registrar console or export the zone file (e.g., sudo rndc zonestatus example.com and sudo named-checkzone).

Choose a low‑traffic window : typically late night or early morning; DNS changes may take hours to propagate globally.

Define a rollback plan : keep the original zone file and a one‑click rollback script.

Notify stakeholders : DNS changes can affect multiple services.

Risk‑control measures

Add a test record first : create test.example.com, verify it resolves, then modify the production record.

Add‑then‑remove principle : add the new IP while keeping the old one, verify traffic, then delete the old IP.

TTL strategy :

Normal operation: TTL = 3600 s (1 h).

Pre‑migration (48 h window): TTL = 300 s (5 min) to accelerate cache expiry.

Post‑migration: raise TTL back to 3600 s.

Prohibited actions in production DNS

Deleting NS records without fully understanding impact.

Setting TTL to 0 unless you are certain all resolvers handle it correctly.

Having different answers on different NS servers – they must stay consistent.

Editing zone files directly with a text editor and overwriting them without using rndc or API.

Bulk modifying many records at once; change in batches and verify each batch.

Failure‑Postmortem Template

【故障复盘】DNS 解析故障

故障域名:
发现时间:
影响范围(哪些用户、哪些服务):
持续时长:

故障现象:
- dig/nslookup 结果:
- HTTP 访问结果:
- 不同 DNS 服务器查询结果对比:

排查过程:
1. [命令] 确认问题类型:[NXDOMAIN / SERVFAIL / 超时 / IP 不一致]
2. [命令] 排除本地问题:[测试结果,如 @8.8.8.8 查询正常]
3. [命令] 检查权威 DNS:[测试结果]
4. [命令] 检查 whois 域名状态:[状态结果]
5. [命令] 检查 DNS 传播:[各 NS 查询结果对比]
6. [命令] 检查 DNSSEC:[测试结果]

根因:[具体说明]

修复措施:
1. [具体操作内容]
2. [具体操作内容]

验证结果:
- 权威 DNS 验证:
- 公共 DNS 验证:
- HTTP 访问验证:

预防措施:
1. 常态化 DNS 健康检查(监控 NS 可用性、SOA serial 变化)
2. 域名到期前提醒
3. DNSSEC 到期前提醒
4. ...

CDN‑Specific DNS Issues

When a CDN is in use, DNS does more than map a name to an IP – it also performs geo‑based scheduling. Problems can arise from mis‑configured CNAMEs or origin settings.

CDN scheduling failure causing slow access

Typical symptom: users experience very slow page loads, yet dig returns a valid IP. The returned IP may belong to a distant CDN node or an unavailable edge.

# Compare resolution from different locations
 dig @ns.bjtelecom.net www.example.com +short   # Beijing DNS
 dig @ns.shtelecom.com www.example.com +short   # Shanghai DNS
 dig @8.8.8.8 www.example.com +short           # Global DNS

# For Cloudflare, IPs are in 104.x.x.x – 172.x.x.x range.
# For Alibaba CDN, IPs belong to Alibaba public ranges.

# Verify the CNAME points to the CDN‑provided domain
 dig www.example.com CNAME +short

# Example of a correct CDN CNAME
# www.example.com CNAME www.example.com.w.kunluncan.com.

Fixes include correcting the CNAME to the CDN‑provided hostname and ensuring the origin IP/hostname is reachable.

DNSSEC deep‑dive troubleshooting

DNSSEC ensures authenticity of DNS data. Misconfiguration leads to SERVFAIL. The article provides a step‑by‑step verification:

# 1. Does the domain have DNSSEC?
 dig @8.8.8.8 example.com DNSKEY +short

# 2. Retrieve DS record from the parent zone
 dig @8.8.8.8 example.com DS +short

# 3. Verify RRSIG signatures
 dig @8.8.8.8 +dnssec www.example.com A

# 4. Use drill or delv for full validation
 drill -D www.example.com
 delv -d www.example.com

# 5. Check that the resolver trusts the root DNSKEY (usually pre‑installed).
 # For custom resolvers, ensure trusted‑keys are configured.
 grep -r "trusted-keys" /etc/named.conf

If DNSSEC is the culprit, the quickest mitigation is to disable DNSSEC at the registrar, verify service restoration, then correctly re‑publish the DS record and DNSKEY chain.

Summary of the DNS Troubleshooting Path

Ping/IP direct test → dig/nslookup (type, error code, answer) →
Query different DNS servers (local vs public) → dig +trace (full recursion) →
whois (domain hold, expiration) → tcpdump (packet flow) →
Root‑cause remediation (cache, config error, propagation, network block) →
Multi‑point verification (authoritative, public, local, multiple tools)

Key DNS failure types and quick checks: NXDOMAIN – check registrar status with whois. SERVFAIL – disable DNSSEC temporarily, inspect server logs. REFUSED – verify ACLs and firewall rules. TIMEOUT – test UDP/TCP connectivity with nc and dig +tcp.

IP mismatch across regions – investigate propagation and NS consistency.

Partial user impact – use global probing tools to locate regional DNS or CDN issues.

By following a layered approach—local resolver, recursive path, authoritative servers, registrar, DNSSEC, and network—you can pinpoint most DNS problems within 5–15 minutes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

networkTroubleshootingDNSDNSSECdigsystemd-resolved
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.