How JD Logistics Built a 300‑Million‑Metric Real‑Time Monitoring System for 99.999% Uptime
This article details JD Logistics' journey to design and implement a massive, AI‑enhanced monitoring platform that handles over three million metrics across hundreds of warehouses, addressing challenges of scale, network complexity, frequent asset changes, and integrating AIOps for proactive fault detection and resolution.
1. Challenges
Geographically dispersed warehouses – Over 700 warehouses nationwide and additional international sites require stringent real‑time monitoring.
Massive number of machines and applications – More than 3 million monitoring items across data centers and warehouse systems.
Complex network environment – Diverse regional and network conditions with intricate inter‑service calls.
Varied monitoring objects – Includes system resources, call chains, logs, and business metrics.
Frequent asset changes – Cloud‑native elastic scaling leads to rapid changes in assets and relationships.
Inconsistent deployment environments – Different tech stacks and languages cause heterogeneous configurations.
Platformization, Dataization, Intelligence
Platformization builds CMDB, monitoring, and DevOps foundations to shift from manual to automated operations. Dataization stores asset, monitoring, alarm, and incident data in a big‑data platform for mining and machine‑learning. Intelligence applies AI algorithms to extract value from the digitized data.
2. Large‑Scale Monitoring System Solution
Understanding Monitoring
Monitoring provides complete controllability of data centers and business systems, covering performance, capacity, delivery, change, efficiency, production, security, and fault control. A robust monitoring platform enables early fault detection, automated mitigation, and even predictive prevention.
Monitoring Operations Planning
The planning is divided into four layers: Resource layer (physical, virtual, applications, middleware, databases), Platform layer (CMDB, ITSM, event/issue management), Data layer (big‑data processing for trend analysis and reporting), and AIOps layer (problem discovery, resolution, and avoidance).
Reliable CMDB Construction
To cope with frequent asset changes, five mechanisms are employed: automatic discovery, business‑interface notifications, scheduled synchronization, workflow‑driven change processes, and manual adjustments for exceptional cases.
Technical Architecture
Monitoring data flows through four stages: collection (active protocol‑based and passive agent‑push), analysis, decision, and handling. Agents push data to a Kafka queue; a consumer writes to an internal distributed in‑memory queue (jimdb) to avoid Kafka partition I/O bottlenecks. A heartbeat service detects lost agents.
Compatibility Design
All existing monitoring systems (log, Docker, MDC, DBS, UMP, tracing) are unified into a single data hub, enabling comprehensive analysis and root‑cause tracing.
Anomaly Detection
Five detection methods are discussed: comparison with previous point, period‑over‑period (day‑over‑day, month‑over‑month), baseline‑based detection, prediction‑based detection using algorithms such as linear regression, LSTM, decision trees, random forest, neural networks, and the Holt‑Winters triple‑exponential smoothing method.
Call‑Chain Monitoring
Based on Pinpoint, the Jtrace distributed tracing system offers seven capabilities: distributed transaction tracing, automatic topology discovery, horizontal scalability, code‑level visibility, bytecode instrumentation, SQL advisory, and adaptive sampling.
Event Processing Engine
Alarms are routed to a rule engine that matches learned or preset rules, then to an execution engine that handles events either manually (for high‑risk cases) or automatically, with permission checks and audit logging.
Intelligent Knowledge Base
Historical alarm data is continuously mined to build a self‑learning knowledge base that supports fault localization, automated handling, and serves as a knowledge source for operation bots.
Best Practices for AIOps‑Enabled Monitoring
Fault Snapshot
When an anomaly triggers an alarm, a snapshot captures CPU‑intensive threads, JVM heap stacks, network, and disk I/O information, plus metric trends before and after the event, stored locally to avoid network overload.
Network Detection Model
A hierarchical network‑latency topology is built by pinging nodes and iteratively adding leaf nodes based on latency thresholds, automatically rebuilding when network conditions change.
Trend Prediction
LSTM models (implemented with TensorFlow, learning rate 0.0001, MSE loss, Adam optimizer, 85% training split) predict metric trends; deviations beyond predicted bounds generate alerts and guide capacity planning.
3. Planning & Outlook
The roadmap emphasizes deeper AI‑algorithm integration, product focus on business value, and organizational enhancements by adding dedicated AI engineers to empower intelligent operations.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.