Operations 16 min read

How JD.com Scales Network Monitoring for Massive Traffic Peaks

This article explains how JD.com’s network team continuously optimizes its large‑scale infrastructure, designs effective monitoring strategies, implements practical monitoring solutions, and outlines future directions to improve network availability, fault detection, and operational efficiency across data centers and the internet backbone.

Efficient Ops

Nov 20, 2017

How JD.com Scales Network Monitoring for Massive Traffic Peaks

1. JD.com Network Status

JD.com’s traffic has grown rapidly from 2014 to 2017, with DCI (dedicated line) traffic doubling during the 2017 618 promotion, driven by big‑data and log‑analysis workloads. Independent business data centers have emerged, requiring diverse hardware, performance, and reliability specifications.

Key architectural upgrades include a nationwide 100 Gbps backbone spanning Beijing, Shanghai, and Guangzhou, a rebuilt internet access layer with dual‑core BGP, and a transition from a four‑core to a dual‑core DCN design to improve scalability and manageability.

2. Monitoring Design Considerations

2.1 Define Monitoring Goals

Determine what “good” network performance means.

Accurately detect anomalies on core metrics.

Rapidly classify issues and trigger appropriate responses.

2.2 Define “Good” Network Standards

Network health must be judged from the user’s perspective, focusing on service availability rather than merely device status.

2.3 Effective Perception Methods

Adopt black‑box monitoring that simulates user experience while still leveraging white‑box data, prioritizing the most severe and frequent faults.

2.4 Incident Handling and Decision Mechanism

Distinguish between self‑healing issues and those requiring manual intervention, and establish clear escalation procedures.

3. JD.com Monitoring Practices

3.1 Preparation

Deploy AAA for device management, NTP for time synchronization, SNMP for data collection, Syslog for post‑event analysis, and maintain a CMDB with manual inventory of critical interfaces.

3.2 Core Monitoring

Track real‑time traffic on internet exits, POD uplinks, and DCI links, as well as 24‑hour peaks, traffic ratios, Syslog/drop/CRC totals, application performance alerts, and overall device health.

3.3 Internet Quality Cases

Examples show ISP‑specific outages, high utilization on specific internet exits, and spikes in Syslog alerts, illustrating how visual dashboards help pinpoint problems quickly.

3.4 DCN Quality Cases

Pingmesh‑style black‑box monitoring reveals internal data‑center packet loss and latency, uncovering issues previously assumed to be stable.

4. Future Outlook

Monitoring will evolve from simple fault detection to an automation‑enabling platform that frees engineers from repetitive analysis, improves network availability, and supports large‑scale operations. Emphasis will shift toward internet quality improvements and deeper insight into data‑center network health.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations network optimization Network Monitoring JD.com large-scale networks monitoring design

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.