Lingjing System: Alibaba's Integrated Hardware‑Software Performance Diagnosis Platform
The Lingjing system, built by Alibaba Infrastructure, provides an end‑to‑end hardware‑software performance diagnosis platform that collects fine‑grained metrics, visualizes data, automatically detects anomalies, and helps optimize resource utilization across complex data‑center stacks.
With the rapid evolution of internet technologies, data centers are transitioning from the IT era to the DT era, driving unprecedented demands for massive data computation and storage, and spurring fast development of both hardware (AEP, AI, FPGA, GPU, CPU, RDMA, etc.) and software architectures (virtualization, containers, etc.).
This deepening stack makes systems increasingly complex, creating challenges for achieving extreme performance and for pinpointing whether bottlenecks stem from software or hardware, especially when performance jitter occurs.
The article identifies four major performance issues in modern servers:
Unawareness – Existing business‑level monitoring focuses only on the application layer and cannot perceive performance fluctuations across the deep hardware‑software stack, leaving hardware behavior a black box.
Low hardware resource utilization – Although Alibaba improves utilization by co‑locating multiple workloads on a single physical machine, resource interference among different services hampers further gains.
Problem complexity – The diversity of server hardware and the growing complexity of system software make it difficult to diagnose performance problems without deep expertise and substantial manpower.
Hardware‑software separation – Specialization leads to a gap: software engineers lack hardware knowledge, while hardware engineers are unfamiliar with business requirements, making it hard to design or select optimal hardware for specific workloads.
Lingjing System
To address these challenges, Alibaba Infrastructure has created Lingjing, an integrated performance diagnosis and optimization platform that captures precise low‑level hardware and software metrics, applies powerful data analysis, and provides a full‑stack view of performance.
Key Characteristics
Dataization – Transforms previously opaque performance aspects into quantifiable data, revealing the true source of interference (e.g., cache, memory bandwidth, CPU lock) and enabling proactive scheduling decisions. Continuous refinement adds fine‑grained metrics such as AliIPF and AliNVME for storage I/O jitter.
Automatic perception – The platform automatically detects and analyzes performance anomalies, especially jitter, without manual testing.
Platformization – Encapsulates performance‑optimization expertise into an intelligent platform, lowering the barrier for diagnosing and resolving issues.
Hardware‑software coupling – By profiling workloads and applying multiple models, Lingjing creates business portraits that guide custom hardware design, maximizing hardware utilization for specific services.
Core Components
Data collection – Utilizes Alibaba’s self‑developed user‑space tool xperf , which gathers CPU core PMU data, uncore metrics, memory bandwidth, RDT, network traffic, storage I/O, GPU performance, etc., with minimal overhead and memory footprint.
Data middle platform – Leverages Alibaba’s large‑scale data storage and computing platform to unify and process collected metrics, breaking data silos.
Data analysis – Applies various models such as online hardware performance scoring, baseline analysis, offline feature extraction, performance tiering, and business profiling to provide automated alerts, diagnostics, and recommendations for workload scheduling, resource allocation, and hardware customization.
Lingjing is already deployed across multiple Alibaba businesses, delivering concrete value in problem localization, scheduling optimization, and custom hardware design, while continuously evolving to offer more intelligent performance‑related services.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.