How JD.com Scales Network Monitoring for Massive Traffic Peaks
This article explains how JD.com’s network team continuously optimizes its large‑scale infrastructure, designs effective monitoring strategies, implements practical monitoring solutions, and outlines future directions to improve network availability, fault detection, and operational efficiency across data centers and the internet backbone.
1. JD.com Network Status
JD.com’s traffic has grown rapidly from 2014 to 2017, with DCI (dedicated line) traffic doubling during the 2017 618 promotion, driven by big‑data and log‑analysis workloads. Independent business data centers have emerged, requiring diverse hardware, performance, and reliability specifications.
Key architectural upgrades include a nationwide 100 Gbps backbone spanning Beijing, Shanghai, and Guangzhou, a rebuilt internet access layer with dual‑core BGP, and a transition from a four‑core to a dual‑core DCN design to improve scalability and manageability.
2. Monitoring Design Considerations
2.1 Define Monitoring Goals
Determine what “good” network performance means.
Accurately detect anomalies on core metrics.
Rapidly classify issues and trigger appropriate responses.
2.2 Define “Good” Network Standards
Network health must be judged from the user’s perspective, focusing on service availability rather than merely device status.
2.3 Effective Perception Methods
Adopt black‑box monitoring that simulates user experience while still leveraging white‑box data, prioritizing the most severe and frequent faults.
2.4 Incident Handling and Decision Mechanism
Distinguish between self‑healing issues and those requiring manual intervention, and establish clear escalation procedures.
3. JD.com Monitoring Practices
3.1 Preparation
Deploy AAA for device management, NTP for time synchronization, SNMP for data collection, Syslog for post‑event analysis, and maintain a CMDB with manual inventory of critical interfaces.
3.2 Core Monitoring
Track real‑time traffic on internet exits, POD uplinks, and DCI links, as well as 24‑hour peaks, traffic ratios, Syslog/drop/CRC totals, application performance alerts, and overall device health.
3.3 Internet Quality Cases
Examples show ISP‑specific outages, high utilization on specific internet exits, and spikes in Syslog alerts, illustrating how visual dashboards help pinpoint problems quickly.
3.4 DCN Quality Cases
Pingmesh‑style black‑box monitoring reveals internal data‑center packet loss and latency, uncovering issues previously assumed to be stable.
4. Future Outlook
Monitoring will evolve from simple fault detection to an automation‑enabling platform that frees engineers from repetitive analysis, improves network availability, and supports large‑scale operations. Emphasis will shift toward internet quality improvements and deeper insight into data‑center network health.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.