How Ant Group Solves Client Observability Challenges with CeresDB and AI
This article explains Ant Group's client observability system, the technical difficulties of tracing, logging, and metrics on mobile clients, and presents their open‑source solutions—including a custom time‑series database, dimension‑join services, and intelligent alerting—to handle massive data and multi‑dimensional analysis.
1. Introduction to Ant Client Observability System
Ant's observability platform covers a vast number of systems. It addresses two main reasons for focusing on client observability: the increasing complexity of client architectures (including mobile apps and mini‑programs) and the critical impact of client‑side issues on user experience.
The scope includes client availability, app storage performance, white‑screen and crash problems, and power‑consumption monitoring across different app versions.
2. Technical Challenges of Client Observability
Key challenges mirror server‑side observability (Logging, Metrics, Tracing) but are amplified on the client side.
Tracing : Client‑side services and middleware often cannot be linked to backend traces, reducing their value.
Logging : Massive log volume from hundreds of millions of installations creates storage, processing, and version‑matching difficulties.
Metrics : Aggregated metrics must handle version differences and the common “dimension explosion” problem.
3. Core Technology Sharing
Unlike server logs, client logs are collected via a gateway and sent to SLS, then processed similarly to server data. Data flow includes point‑in‑time collection, enrichment with version information, distributed caching, and real‑time computation using Spark.
Large‑scale dimension tables and unified services require extensive data alignment, enrichment, and storage in a real‑time database.
Problem 1: Tag Explosion and Multi‑Dimensional Analysis
Solution Part 1 – Dimension Service and Join
Standard time‑series tools focus on single‑metric trends and lack multi‑dimensional join capabilities. Ant introduced a join layer that combines external data, producing composite dimensions such as
app.province.area.
Solution Part 2 – Analytic Time‑Series Database CeresDB
To handle tag explosion, Ant built CeresDB with columnar storage and partition pruning, storing data by day/hour segments for efficient pruning.
Solution Part 3 – Compute‑Storage Separation Architecture
CeresDB separates compute from storage, allowing data to be off‑loaded to Kafka or OSS, with WAL for reliability.
Solution Part 4 – Query Performance Optimization
Performance is improved through horizontal partition tables, compute‑storage separation, and caching strategies that reduce first‑query latency and ensure consistent query speed.
Solution Part 5 – Additional Optimizations
Further tweaks include parallel remote I/O, multi‑level caching, and background task tuning to avoid performance spikes.
Problem 2: Massive Diverse Observed‑Entity Alerts
Solution 1 – Intelligent Alerting
Ant adopts a three‑layer architecture: algorithm routing, detection, and noise reduction. Routing selects appropriate algorithms based on data characteristics, avoiding unnecessary computation.
Solution 2 – Dynamic Threshold Generation
Rules are automatically generated from historical curves, with dynamic thresholds adapting to time‑segment patterns. Feature transformation (e.g., first‑order differencing) helps normalize irregular traffic patterns before rule creation.
4. Open‑Source Projects and Evolution
Key technologies have been open‑sourced: CeresDB (released June last year) and HoloInsight (released February this year). The internal Ant version of CeresDB matches the open‑source release, enabling straightforward internal deployment.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.