Operations 17 min read

How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale

This article explains how Alibaba’s Sunfire monitoring platform processes terabytes of logs per minute, uses a pull‑based architecture with Brain‑Reduce‑Map roles, tackles scalability and reliability challenges, and outlines future directions such as MQL standardization and intelligent baselines.

Efficient Ops
Efficient Ops
Efficient Ops
How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale

Architecture

Sunfire processes TB‑level logs per minute using a three‑role model—Brain, Reduce and Map. ConfigDB stores monitoring items, Brain generates a topology, installs it on Reduce which splits it into Map tasks that pull logs from agents.

Traditional log monitoring

Typical pipelines use agents to push log increments to Kafka, then stream engines such as Flink/JStorm consume the data, perform multi‑step processing and finally store results in a database, which introduces latency in alert generation.

Key innovations

Preload : tasks are registered in advance, allowing the system to detect and mask faulty agents before real execution, reducing delay.

Pull model : the server controls data collection, can decide to retry or abandon, and ensures all data are processed within a bounded time.

Zero‑copy log transfer on the agent side eliminates CPU overhead.

Dynamic binary search finds log timestamps without user‑specified positions, keeping CPU usage under 8% even in extreme cases.

Scale and challenges

More than 80 tenants share over 6000 machines and generate >3 TB of logs per minute. Challenges include achieving second‑level monitoring, minimizing monitoring overhead, handling machine failures, and guaranteeing accuracy.

Reliability mechanisms

Brain continuously monitors Reduce and Map tasks, retries failed nodes, and employs self‑protection logic that caps resource consumption per monitoring item. The system also tracks completeness metrics to indicate how many agents successfully delivered logs.

Technical choices

The pull‑based architecture keeps all computation on the server side, uses a lightweight custom framework inspired by Akka, stores results in HBase (with HiTSDB under evaluation), and relies on self‑operated components to avoid external dependencies.

Future direction

Four pillars guide development: standardization via a unified query language (MQL), integration of change detection and host/network monitoring, service‑orientation with DingTalk‑based one‑stop alert handling, and intelligence such as smart baselines that auto‑generate alert thresholds.

<code>select avg(cpu.util),max(load.load1) from system where app='AppTest' since 30m
select * from sunfire.1005_SM_13 since 30m
select * from spring filter class='classA' and method='methodB' where ip='192.168.1.1' since 1h</code>
monitoringreal-timeoperationsLarge Scalelog-processing
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.