Design and Evolution of Bilibili Intranet DNS Service
The article details Bilibili’s internal DNS service evolution—from an initial BIND9 master‑slave setup to a multi‑level caching architecture that boosts QPS to over 1.5 million—while describing comprehensive host, business, and client monitoring, key configuration pitfalls, and best‑practice recommendations for a low‑latency, reliable intranet DNS.
Domain Name System (DNS) acts as the Internet's address book, mapping complex IP addresses to easy‑to‑remember domain names and providing services such as load balancing. An internal (intranet) DNS service adds private domains, DNS hijacking for internal routing and security, specific business‑logic support, and ultra‑low latency with high throughput.
The article shares the practice of building Bilibili's internal DNS service.
Architecture Evolution
Initially, the team chose BIND9, the most widely used DNS implementation, and deployed two roles:
Authoritative Name Server (primary and secondary) for final domain resolution.
Caching Name Server (Resolver) to handle recursive queries, cache responses, and reduce load on authoritative servers.
First‑generation architecture used a simple master‑slave model with VIP‑based load balancing across IDC data centers. As traffic grew, latency spikes and limited scalability of secondary servers prompted a redesign.
Second‑generation architecture introduced multi‑level caching: dedicated Caching Name Servers for scalability, and NSCD as a client‑side cache for high‑QPS services (e.g., big data, AI). BIND9 was upgraded to a newer version supporting reuseport and log buffering, raising single‑instance QPS from ~100k to >1.5 million.
DNS Service Monitoring
Monitoring is divided into three layers:
Host layer – CPU, memory, network, disk usage; alerts for single‑core CPU or single NIC overload.
Business layer – BIND internal metrics via statistics‑channels (custom exporter replaces bind_exporter), zone record change rate alerts, and BIND error‑log monitoring.
Client layer – Probes from multiple data centers simulate real user requests, checking availability, content correctness, latency, and packet loss; also monitor public DNS stability.
Pitfalls and Best Practices
Ensure both UDP and TCP ports are reachable; DNS over UDP is limited to 512 bytes, larger responses require TCP.
After zone changes, increment the SOA serial number; otherwise master‑slave synchronization fails.
Avoid using rndc flush to clear the entire cache; prefer flushname or flushtree for targeted refreshes.
Use wildcard records cautiously; adding a specific record (e.g., TXT) without an A/AAAA or CNAME can break access.
Conclusion
Robust infrastructure services like internal DNS act as levers for business efficiency, reducing development and operational costs. Continuous evolution based on business needs ensures a stable, reliable, and easy‑to‑use DNS service.
References include the BIND 9 Administrator Reference Manual, RFC 1035, RFC 1912, and ISC knowledge‑base articles.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.