Operations 8 min read

Automated Service Fault Localization System Architecture

The automated service fault localization system ingests massive real‑time instrumentation data, builds call‑chain graphs, and instantly pinpoints the exact component causing timeouts or other errors, achieving developer‑level accuracy within seconds instead of minutes while remaining simple, fast, and fully automated.

Xianyu Technology
Xianyu Technology
Xianyu Technology
Automated Service Fault Localization System Architecture

Service issue investigation is a routine task for developers, but it consumes a lot of time; rapid fault resolution is critical.

The main obstacles are:

Massive alert information.

Complex call chains.

Complicated investigation process.

Reliance on experience.

These challenges can be addressed by building an experience model.

Example: an order list service depends on seller, product, and shop services; a timeout on host 127.123.12.12 causes the order list to timeout.

Key questions include accurately defining timeouts/exceptions, generating upstream/downstream call chains, pinpointing the responsible component, and distinguishing timeout, thread‑pool‑full, or unknown errors.

Underlying data instrumentation provided by Alibaba enables solutions; with this data, a fully automated fault localization system is feasible.

System Goals

The system must satisfy four goals, which are also its main challenges:

Accuracy (locating as precisely as a developer).

Speed (locating before monitoring alerts).

Simplicity (shortest path from detection to result).

Automation.

Four Modules

Data Collection

Collects and reports massive instrumentation data (up to 80 GB/min) with low latency and extensible metrics, using Alibaba Cloud SLS and custom plugins.

Real‑time Computing

Preprocesses data: links requests by unique IDs, cleanses data, and emits events. Challenges: compute latency, multi‑source coordination, data cleaning, storage cost.

Real‑time Analysis

Generates problem path graphs from events. Challenges: real‑time vs offline topology, data loss, analysis accuracy.

Aggregation Display

Aggregates problem paths in real time to reconstruct the incident scene, balancing query performance, concurrency, and storage cost.

Results

Since deployment, fault localization time dropped from 10 minutes to under 5 seconds. Example cases: (1) Xianyu product publish alert resolved in <5 s; (2) homepage slowdown due to single‑machine GC identified instantly.

Conclusion

The system focuses on service stability; future work includes richer data sources, comprehensive event abstraction, and building a knowledge graph for end‑to‑end incident handling.

System ArchitectureBig Datafault localizationOperationsreal-time analytics
Xianyu Technology
Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.