Cloud Native 15 min read

Building a Production‑Grade Observability System for Alibaba Cloud ACK Container Service

The presentation outlines Alibaba Cloud's ACK container service observability framework, covering its architecture, key capabilities such as eBPF‑based tracing, GPU profiling, network diagnostics, storage monitoring, and FinOps integration, and demonstrates how these features support AI workloads, large‑scale production stability, and automated incident response.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Building a Production‑Grade Observability System for Alibaba Cloud ACK Container Service

In this talk, Feng Shichun, the observability lead of Alibaba Cloud Container Service, introduces the production‑grade observability system built for Alibaba Cloud ACK (Alibaba Cloud Kubernetes) and shares practical experiences from recent high‑profile deployments such as the 2024 Paris Olympics.

He starts with recent CNCF statistics showing that Kubernetes adoption in production rose from 76% in 2022 to 89% in 2023, establishing Kubernetes as the de‑facto standard for cloud‑native workloads.

The core message is that observability is a critical pillar of a mature Infra team, enabling reliable operation of complex containerized systems. Gartner’s 2024 Magic Quadrant named Alibaba Cloud the sole Asian leader, and Forrester highlighted its parity with Google in the public‑cloud container market.

The presentation is organized into three parts: (1) an overview of the ACK observability architecture, (2) recent advances in key scenarios such as AI workloads, container networking, and storage, and (3) how observability data drives FinOps and AIOps initiatives.

Key use cases include:

Ensuring business‑critical services run smoothly on ACK clusters, illustrated by the “bullet‑time” performance boost for Olympic online systems.

Performance tuning for large‑scale clusters by exposing transparent container‑layer metrics and control‑plane health.

End‑to‑end fault diagnosis using alerts, Prometheus metrics, SLS logs, distributed tracing, and code‑level profiling to achieve low MTTR.

For AI scenarios, ACK provides GPU monitoring dashboards, automatic bad‑GPU detection, cost analysis, and an eBPF‑based GPU profiling tool that pinpoints bottlenecks in PyTorch jobs.

Network observability is enhanced by the KubeSkoop toolkit, which leverages eBPF to collect kernel‑level network data, offering one‑click diagnostics, historical traffic replay, and full‑mesh topology visualization.

Storage observability is delivered via a CSI‑based solution that integrates Prometheus metrics, K8s events, and logs to monitor disk health, I/O throughput, and capacity, especially for high‑throughput AI training workloads.

FinOps capabilities aggregate multi‑dimensional cost data down to the pod level, providing waste analysis and optimization recommendations, helping customers reduce resource consumption by up to 25%.

Looking ahead, Alibaba Cloud will launch ACK AI Assistant 2.0, a ChatOps‑enabled assistant powered by Tongyi Qianwen that combines observability data with expert diagnostics to accelerate incident response and provide proactive health checks.

The talk concludes with a recruitment call for developers, SREs, and product managers to join the Alibaba Cloud Container team in Hangzhou, Beijing, and Shenzhen.

cloud nativeAIObservabilityKubernetesFinOpseBPFContainer Service
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.