Operations 28 min read

AIOps Implementation at Xiaohongshu: Fault Localization and Intelligent Operations

Xiaohongshu’s AIOps initiative builds a four‑layer framework that leverages machine‑learning‑driven anomaly detection, causal analysis, and trace‑based fault localization to automatically identify root‑cause services in micro‑service environments, achieving over 80 % accuracy across 1000 daily diagnoses while guiding future enhancements in change correlation and automated remediation.

Xiaohongshu Tech REDtech

Oct 9, 2024

AIOps Implementation at Xiaohongshu: Fault Localization and Intelligent Operations

This article provides a comprehensive overview of Xiaohongshu's AIOps (Artificial Intelligence for IT Operations) implementation, focusing on fault localization in microservice architectures. The paper begins by defining AIOps as the application of machine learning algorithms to IT operations data to solve problems that traditional tools cannot address, enabling the transition from tool-based to intelligent operations.

The article analyzes the current state of AIOps both in the industry and at Xiaohongshu, noting that while the company has established DevOps foundations with various monitoring and operational tools, intelligent capabilities remain in early stages. The proposed AIOps capability framework consists of four layers: foundational data capabilities (metrics, traces, logs, topology, events), algorithm capabilities (anomaly detection, time series prediction, classification, feature extraction, causal analysis), operational scenario support (fault management, change management, resource management), and collaboration with IaaS/PaaS layers.

The evolution strategy focuses on three core directions: stability assurance (anomaly detection, multi-dimensional analysis, similarity analysis, time series prediction, correlation mining, causal analysis), cost management (time series prediction, service profiling, performance optimization), and efficiency improvement (anomaly detection for change control, intelligent scheduling, auto-repair).

The fault localization system is designed for microservice architectures where complex service interactions make root cause identification challenging. The system defines the problem as finding root cause service nodes when business scenarios encounter issues, based on call topology analysis. The solution involves generating call topologies from trace or RPC data, collecting metrics and events, performing anomaly detection using various algorithms (including SR-CNN for change point detection), extracting abnormal topologies through pruning, and applying root cause analysis using RCSF (anomaly frequent itemset mining) combined with expert rules.

The implementation has been deployed across all business lines, covering nearly 100 core scenarios with over 1000 daily diagnoses. The system achieves 80%+ accuracy in trace-based fault localization for availability issues. Future improvements include optimizing detection capabilities, strengthening change associations, enhancing root cause analysis, expanding diagnostic scope to include client, access layer, and storage, enabling business-customized diagnostics, and developing automated remediation capabilities.

The paper concludes with references to relevant academic research and industry practices, demonstrating Xiaohongshu's comprehensive approach to implementing AIOps for intelligent IT operations management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Fault Localization Anomaly Detection DevOps aiops Root Cause Analysis Xiaohongshu Intelligent Operations Trace Analysis

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.