Operations 28 min read

AIOps Implementation at Xiaohongshu: Fault Localization and Intelligent Operations

Xiaohongshu’s AIOps initiative builds a four‑layer framework that leverages machine‑learning‑driven anomaly detection, causal analysis, and trace‑based fault localization to automatically identify root‑cause services in micro‑service environments, achieving over 80 % accuracy across 1000 daily diagnoses while guiding future enhancements in change correlation and automated remediation.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
AIOps Implementation at Xiaohongshu: Fault Localization and Intelligent Operations

This article provides a comprehensive overview of Xiaohongshu's AIOps (Artificial Intelligence for IT Operations) implementation, focusing on fault localization in microservice architectures. The paper begins by defining AIOps as the application of machine learning algorithms to IT operations data to solve problems that traditional tools cannot address, enabling the transition from tool-based to intelligent operations.

The article analyzes the current state of AIOps both in the industry and at Xiaohongshu, noting that while the company has established DevOps foundations with various monitoring and operational tools, intelligent capabilities remain in early stages. The proposed AIOps capability framework consists of four layers: foundational data capabilities (metrics, traces, logs, topology, events), algorithm capabilities (anomaly detection, time series prediction, classification, feature extraction, causal analysis), operational scenario support (fault management, change management, resource management), and collaboration with IaaS/PaaS layers.

The evolution strategy focuses on three core directions: stability assurance (anomaly detection, multi-dimensional analysis, similarity analysis, time series prediction, correlation mining, causal analysis), cost management (time series prediction, service profiling, performance optimization), and efficiency improvement (anomaly detection for change control, intelligent scheduling, auto-repair).

The fault localization system is designed for microservice architectures where complex service interactions make root cause identification challenging. The system defines the problem as finding root cause service nodes when business scenarios encounter issues, based on call topology analysis. The solution involves generating call topologies from trace or RPC data, collecting metrics and events, performing anomaly detection using various algorithms (including SR-CNN for change point detection), extracting abnormal topologies through pruning, and applying root cause analysis using RCSF (anomaly frequent itemset mining) combined with expert rules.

The implementation has been deployed across all business lines, covering nearly 100 core scenarios with over 1000 daily diagnoses. The system achieves 80%+ accuracy in trace-based fault localization for availability issues. Future improvements include optimizing detection capabilities, strengthening change associations, enhancing root cause analysis, expanding diagnostic scope to include client, access layer, and storage, enabling business-customized diagnostics, and developing automated remediation capabilities.

The paper concludes with references to relevant academic research and industry practices, demonstrating Xiaohongshu's comprehensive approach to implementing AIOps for intelligent IT operations management.

microservicesfault localizationAnomaly DetectionDevOpsmachine learningAIOpsroot cause analysisXiaohongshuIntelligent Operationstrace analysis
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.