Operations 17 min read

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

This article presents a comprehensive overview of server fault management at scale, detailing the classification of failures, shortcomings of traditional manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerting, and end‑to‑end repair workflows, while also outlining future directions for intelligent monitoring and reliability.

Bilibili Tech

May 27, 2025

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

1. Background

With the rapid growth of Bilibili's services, the number of servers has exploded, bringing challenges in fault management due to the sheer scale of hardware. Manual troubleshooting is inefficient and toolchains are fragmented, prompting the need for an efficient, automated approach to maintain platform stability and user experience.

2. Server Faults

2.1 Fault Classification

Faults are divided into software (soft) faults such as file‑system errors and service anomalies, and hardware (hard) faults including disk, NIC, and GPU failures. Additionally, faults can be categorized by repair mode: online (non‑impacting) and offline (requiring downtime).

2.2 Limitations of Traditional Fault Management

Late fault discovery due to reliance on manual checks or user reports.

Low investigation efficiency because of time‑consuming manual root‑cause analysis.

High communication cost between operations and business teams.

Insufficient automation in the repair process, lacking systematic audit and traceability.

These issues motivate a full automation of fault detection and repair.

2.3 Objectives

Eliminate delayed fault discovery and low investigation efficiency through an automated detection pipeline (information collection → rule matching → alert).

Reduce communication overhead and automate the repair workflow, enabling efficient collaboration and end‑to‑end management.

3. Automated Fault Detection Solution

The detection architecture consists of five core components: an in‑band Agent, a log platform, a detection service, a rule database, and a fault‑management platform.

Agent: lightweight component on each server that gathers hardware status (disk, NIC, GPU) and reports to the detection service.

Log Platform: collects system logs via rsyslog for analysis.

Detection Service: processes both in‑band data (Agent reports, dmesg) and out‑of‑band data (SNMP traps, Redfish API).

DB: stores fault‑detection rules and generated alerts.

Fault‑Management Platform: visualizes alerts for operators.

We then detail in‑band and out‑of‑band information collection.

3.1 Information Collection Methods

Two primary methods are identified:

In‑band collection: uses OS tools to obtain detailed system data but fails when the server crashes.

Out‑of‑band collection: leverages BMC (Redfish API, SNMP traps) to gather hardware status even when the OS is down, though with coarser granularity.

Combining both provides comprehensive monitoring.

3.1.1 In‑band Collection

We developed a custom Agent to supplement kernel logs, gathering disk health (remaining life, bad blocks) and GPU metrics (utilization, ECC errors, power). The Agent invokes tools such as dmidecode, lspci, and vendor‑specific utilities, structures the data, and forwards it to detection and asset services.

3.1.2 Out‑of‑band Collection

When servers are down, out‑of‑band collection via Redfish API or SNMP traps captures critical fault information. SNMP traps provide high accuracy through vendor‑specific OIDs, while Redfish offers a complementary health‑check mechanism.

3.2 Fault Rule Management

A unified rule database defines fault codes, descriptions, component types, severity levels (P0‑P2), and trigger expressions (log keywords, metric thresholds). This standardization enables rapid fault identification and guided remediation.

4. Automated Repair Solution

4.1 Business Online/Offline Automation

Previously, fault notifications relied on manual WeChat messages, causing delays. The new automated workflow triggers repair tasks upon detection, interacts with business systems via callbacks, and supports both online (no downtime) and offline (requires downtime) repair modes.

4.2 Repair Process Automation

Email/API notifications to relevant parties after task creation.

Automatic asset updates when hardware components are replaced.

Server status auto‑transition (e.g., “under repair”, “delivered”).

Post‑repair health checks to verify restoration.

5. Summary and Outlook

5.1 Summary

The presented architecture integrates hardware monitoring, log analysis, fault detection, and repair management, achieving 99% coverage, 99% accuracy, and 95% recall in large‑scale data centers.

5.2 Outlook

Intelligent monitoring using machine learning for proactive fault prediction.

More precise fault localization and faster remediation.

Enhanced security and reliability for hardware and software components.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring automation Operations Infrastructure server fault management

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.