Cloud Native 13 min read

How Alibaba Cloud’s One‑Click I/O Diagnosis Tackles Cloud‑Native I/O Bottlenecks

This article explains how Alibaba Cloud CloudMonitor 2.0 integrates SysOM intelligent diagnosis to automatically detect, analyze, and remediate I/O anomalies in multi‑tenant cloud environments, detailing the architecture, dynamic threshold algorithm, anomaly‑trigger logic, and real‑world case studies.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Alibaba Cloud’s One‑Click I/O Diagnosis Tackles Cloud‑Native I/O Bottlenecks

Background

Rapid growth of AI training data, logs and media in cloud environments leads to a sharp increase in I/O request rates. In multi‑tenant, hybrid‑cloud or multi‑cloud deployments, concurrent access to shared storage creates I/O contention and performance bottlenecks, while the diversity of storage stacks makes fault localization difficult.

Key Technical Challenges

Ambiguous I/O anomaly types – Users cannot easily differentiate latency spikes, throughput saturation or other I/O issues, requiring expert intervention.

Insufficient real‑time evidence – Traditional monitoring captures generic OS metrics; by the time an alarm fires the root cause may have already passed.

Disconnected metric‑to‑diagnosis mapping – Isolated metrics must be manually correlated with diagnostic tools, increasing effort and error.

Solution Overview

Alibaba Cloud CloudMonitor 2.0 together with the SysOM intelligent diagnosis module implements a “detect‑analyze‑remediate” workflow for common I/O abnormal scenarios. The system follows a “monitor‑first, on‑demand capture” model: during a user‑specified time window it periodically reads I/O metrics, triggers a sub‑diagnostic tool when a metric exceeds a dynamic threshold, and generates a structured diagnostic report.

Architecture

The workflow consists of four stages:

Metric collection – At a configurable cycle (in milliseconds) the system reads key I/O metrics such as await, util, tps, iops, qu‑size and iowait.

Dynamic‑threshold anomaly detection – Collected values are compared against a three‑layer threshold (base, compensation, static minimum). An anomaly is flagged when the value exceeds the larger of the dynamic (base + compensation) and static thresholds.

Automatic diagnostic trigger – The system selects the appropriate sub‑diagnostic tool based on the metric type, applies frequency‑control parameters, and executes the analysis.

Result aggregation – Diagnostic output is summarized, visualized and presented with root‑cause insights and remediation suggestions.

Dynamic Threshold Mechanism

The threshold consists of three components:

Base threshold – A sliding‑window algorithm computes the maximum deviation of each data point from the window’s average (instantaneous fluctuation). The average of these fluctuations over consecutive windows forms an adaptive baseline.

Compensation threshold – Added to the base threshold to smooth rapid declines during quiet periods, preventing false alarms caused by normal noise.

Minimum static threshold – A business‑defined lower bound. The final alarm threshold is the greater of (base + compensation) and this static value.

This three‑layer design enables detection of short‑lived spikes while keeping false‑positive rates low.

Implementation Details

During each cycle the system performs:

# Pseudocode
while within_diagnosis_window:
    metrics = read_io_metrics()
    for m in metrics:
        if m.value > compute_dynamic_threshold(m):
            if can_trigger_diagnosis(m):
                run_subdiagnostic(m)
    sleep(cycle_ms)

Frequency Control

triggerInterval – Minimum interval (seconds) between two diagnoses of the same type to avoid repeated scans.

reportInterval – Number of anomaly occurrences required after the cool‑down period before a diagnosis is launched. When set to 0, any anomaly after the cool‑down triggers immediate diagnosis.

Root‑Cause Analysis

After data capture, the system automatically extracts structured insights:

Identify processes that contribute the most I/O (IO burst contributors).

Highlight paths or devices with the highest latency.

Pinpoint processes and reasons for elevated iowait (e.g., disk saturation, slow dirty‑page flushing).

Case Studies

High iowait

A customer observed overall response slowdown. The diagnostic report identified the task_server process waiting on disk I/O and recommended lowering dirty_ratio and dirty_bytes to reduce write‑back pressure.

High I/O latency

Another case showed sustained write‑latency spikes. The analysis pinpointed DiskBlockWrite as the dominant load and suggested adjusting dirty_ratio and dirty_background_ratio to control dirty‑page flushing, thereby reducing latency.

References

IO One‑Click Diagnosis: https://help.aliyun.com/zh/cms/cloudmonitor-2-0/io-key-diagnosis

SysOM System Diagnosis: https://cmsnext.console.aliyun.com/next/region/cn-shanghai/workspace/default-cms-1808078950770264-cn-shanghai/app/host/host-sysom

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativePerformance Optimizationdiagnosticsdynamic thresholdsaliyunio monitoring
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.