Operations 8 min read

Lingjing System: Alibaba's Integrated Hardware‑Software Performance Diagnosis Platform

The Lingjing system, built by Alibaba Infrastructure, provides an end‑to‑end hardware‑software performance diagnosis platform that collects fine‑grained metrics, visualizes data, automatically detects anomalies, and helps optimize resource utilization across complex data‑center stacks.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Lingjing System: Alibaba's Integrated Hardware‑Software Performance Diagnosis Platform

With the rapid evolution of internet technologies, data centers are transitioning from the IT era to the DT era, driving unprecedented demands for massive data computation and storage, and spurring fast development of both hardware (AEP, AI, FPGA, GPU, CPU, RDMA, etc.) and software architectures (virtualization, containers, etc.).

This deepening stack makes systems increasingly complex, creating challenges for achieving extreme performance and for pinpointing whether bottlenecks stem from software or hardware, especially when performance jitter occurs.

The article identifies four major performance issues in modern servers:

Unawareness – Existing business‑level monitoring focuses only on the application layer and cannot perceive performance fluctuations across the deep hardware‑software stack, leaving hardware behavior a black box.

Low hardware resource utilization – Although Alibaba improves utilization by co‑locating multiple workloads on a single physical machine, resource interference among different services hampers further gains.

Problem complexity – The diversity of server hardware and the growing complexity of system software make it difficult to diagnose performance problems without deep expertise and substantial manpower.

Hardware‑software separation – Specialization leads to a gap: software engineers lack hardware knowledge, while hardware engineers are unfamiliar with business requirements, making it hard to design or select optimal hardware for specific workloads.

Lingjing System

To address these challenges, Alibaba Infrastructure has created Lingjing, an integrated performance diagnosis and optimization platform that captures precise low‑level hardware and software metrics, applies powerful data analysis, and provides a full‑stack view of performance.

Key Characteristics

Dataization – Transforms previously opaque performance aspects into quantifiable data, revealing the true source of interference (e.g., cache, memory bandwidth, CPU lock) and enabling proactive scheduling decisions. Continuous refinement adds fine‑grained metrics such as AliIPF and AliNVME for storage I/O jitter.

Automatic perception – The platform automatically detects and analyzes performance anomalies, especially jitter, without manual testing.

Platformization – Encapsulates performance‑optimization expertise into an intelligent platform, lowering the barrier for diagnosing and resolving issues.

Hardware‑software coupling – By profiling workloads and applying multiple models, Lingjing creates business portraits that guide custom hardware design, maximizing hardware utilization for specific services.

Core Components

Data collection – Utilizes Alibaba’s self‑developed user‑space tool xperf , which gathers CPU core PMU data, uncore metrics, memory bandwidth, RDT, network traffic, storage I/O, GPU performance, etc., with minimal overhead and memory footprint.

Data middle platform – Leverages Alibaba’s large‑scale data storage and computing platform to unify and process collected metrics, breaking data silos.

Data analysis – Applies various models such as online hardware performance scoring, baseline analysis, offline feature extraction, performance tiering, and business profiling to provide automated alerts, diagnostics, and recommendations for workload scheduling, resource allocation, and hardware customization.

Lingjing is already deployed across multiple Alibaba businesses, delivering concrete value in problem localization, scheduling optimization, and custom hardware design, while continuously evolving to offer more intelligent performance‑related services.

Alibabaoperationsperformance monitoringDiagnosticsdata centerhardware-software integration
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.