Operations 23 min read

How JD Logistics Built a 300‑Million‑Metric Real‑Time Monitoring System for 99.999% Uptime

This article details JD Logistics' journey to design and implement a massive, AI‑enhanced monitoring platform that handles over three million metrics across hundreds of warehouses, addressing challenges of scale, network complexity, frequent asset changes, and integrating AIOps for proactive fault detection and resolution.

Efficient Ops

Mar 25, 2020

How JD Logistics Built a 300‑Million‑Metric Real‑Time Monitoring System for 99.999% Uptime

1. Challenges

Geographically dispersed warehouses – Over 700 warehouses nationwide and additional international sites require stringent real‑time monitoring.

Massive number of machines and applications – More than 3 million monitoring items across data centers and warehouse systems.

Complex network environment – Diverse regional and network conditions with intricate inter‑service calls.

Varied monitoring objects – Includes system resources, call chains, logs, and business metrics.

Frequent asset changes – Cloud‑native elastic scaling leads to rapid changes in assets and relationships.

Inconsistent deployment environments – Different tech stacks and languages cause heterogeneous configurations.

Platformization, Dataization, Intelligence

Platformization builds CMDB, monitoring, and DevOps foundations to shift from manual to automated operations. Dataization stores asset, monitoring, alarm, and incident data in a big‑data platform for mining and machine‑learning. Intelligence applies AI algorithms to extract value from the digitized data.

2. Large‑Scale Monitoring System Solution

Understanding Monitoring

Monitoring provides complete controllability of data centers and business systems, covering performance, capacity, delivery, change, efficiency, production, security, and fault control. A robust monitoring platform enables early fault detection, automated mitigation, and even predictive prevention.

Monitoring Operations Planning

The planning is divided into four layers: Resource layer (physical, virtual, applications, middleware, databases), Platform layer (CMDB, ITSM, event/issue management), Data layer (big‑data processing for trend analysis and reporting), and AIOps layer (problem discovery, resolution, and avoidance).

Reliable CMDB Construction

To cope with frequent asset changes, five mechanisms are employed: automatic discovery, business‑interface notifications, scheduled synchronization, workflow‑driven change processes, and manual adjustments for exceptional cases.

Technical Architecture

Monitoring data flows through four stages: collection (active protocol‑based and passive agent‑push), analysis, decision, and handling. Agents push data to a Kafka queue; a consumer writes to an internal distributed in‑memory queue (jimdb) to avoid Kafka partition I/O bottlenecks. A heartbeat service detects lost agents.

Compatibility Design

All existing monitoring systems (log, Docker, MDC, DBS, UMP, tracing) are unified into a single data hub, enabling comprehensive analysis and root‑cause tracing.

Anomaly Detection

Five detection methods are discussed: comparison with previous point, period‑over‑period (day‑over‑day, month‑over‑month), baseline‑based detection, prediction‑based detection using algorithms such as linear regression, LSTM, decision trees, random forest, neural networks, and the Holt‑Winters triple‑exponential smoothing method.

Call‑Chain Monitoring

Based on Pinpoint, the Jtrace distributed tracing system offers seven capabilities: distributed transaction tracing, automatic topology discovery, horizontal scalability, code‑level visibility, bytecode instrumentation, SQL advisory, and adaptive sampling.

Event Processing Engine

Alarms are routed to a rule engine that matches learned or preset rules, then to an execution engine that handles events either manually (for high‑risk cases) or automatically, with permission checks and audit logging.

Intelligent Knowledge Base

Historical alarm data is continuously mined to build a self‑learning knowledge base that supports fault localization, automated handling, and serves as a knowledge source for operation bots.

Best Practices for AIOps‑Enabled Monitoring

Fault Snapshot

When an anomaly triggers an alarm, a snapshot captures CPU‑intensive threads, JVM heap stacks, network, and disk I/O information, plus metric trends before and after the event, stored locally to avoid network overload.

Network Detection Model

A hierarchical network‑latency topology is built by pinging nodes and iteratively adding leaf nodes based on latency thresholds, automatically rebuilding when network conditions change.

Trend Prediction

LSTM models (implemented with TensorFlow, learning rate 0.0001, MSE loss, Adam optimizer, 85% training split) predict metric trends; deviations beyond predicted bounds generate alerts and guide capacity planning.

3. Planning & Outlook

The roadmap emphasizes deeper AI‑algorithm integration, product focus on business value, and organizational enhancements by adding dedicated AI engineers to empower intelligent operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Anomaly Detection Kafka large-scale systems aiops LSTM CMDB

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.