Operations 7 min read

Server Downtime Diagnosis System: Architecture, Implementation, and Results

The article explains why a downtime diagnosis system is needed, outlines its architecture and implementation methods—including log sources, feature extraction, and API integration—and presents early results showing high automation coverage and significant operational cost savings.

Alibaba Cloud Infrastructure

Oct 22, 2018

Server Downtime Diagnosis System: Architecture, Implementation, and Results

As business grows, the number of servers and corresponding failures increase, making it essential to diagnose the causes of server downtime to improve stability.

Why a downtime diagnosis system? Manual analysis is time‑consuming, limited in scope, lacks systematic knowledge accumulation, and becomes increasingly difficult as server counts rise.

Alibaba's Server System Innovation Team offers a dedicated downtime diagnosis product that provides API‑based fault analysis and real‑time log monitoring, enabling automatic identification of known issues and proactive risk detection.

Implementation methods

Two prerequisites: logs and log features. Sources include CONMAN (out‑of‑band serial logs via BMC) and SEL (BMC event logs). Features are extracted from massive downtime data, categorized by component, priority, frequency, and time range, covering about 80% of cases.

The diagnostic workflow relies on a feature library; matching is performed via string scans or inverted‑index tokenization as the rule set grows.

Preliminary results

Automated analysis now covers 95% of scenarios, saving millions of dollars annually and reducing manual effort; downtime detection coverage reached 90% within months, allowing quality experts to focus on critical issues.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

automation Operations diagnosis log analysis fault detection downtime

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.