Operations 22 min read

System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense

Gaode Ride‑Hailing created a comprehensive 360° observability platform—standardized logging, distributed tracing, multi‑domain metrics, visual dashboards, and an incident workflow—that transforms raw data into actionable insights, accelerates root‑cause analysis, and enables automated fault defense for its large‑scale cloud‑native microservice system.

Amap Tech

May 28, 2021

System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense

In the era of rapid internet engineering development, distributed, micro‑service, and containerized architectures have ushered the industry into a cloud‑native age. Systems have evolved from monolithic applications to distributed services, where a single server may only exist for a few hours or minutes, making the visualization of system state increasingly difficult.

Gaode's ride‑hailing business has followed the same path, transitioning from a monolithic architecture to a service‑oriented one. To ensure high performance, high availability, and high controllability of this massive and complex system, a 360° multi‑dimensional observability capability becomes essential.

1. What is System Observability?

Observability (observerbality) is a concept popularized in recent years within the monitoring community, originating from Google’s SRE practices and Cindy Sridharan’s blog "Monitoring and Observability". It is not a specific tool or technology but a philosophy that has become a key component for managing complex distributed systems. Observability refers to the ability to understand, query, explore, and orchestrate a running system, enabling engineers to discover, locate, and resolve issues.

The three pillars of observability widely adopted in the industry are:

Logging: detailed events that explain system state.

Tracing: distributed request tracing that reconstructs the full request flow.

Metrics: aggregated monitoring data visualized for quick insight.

2. Observability vs. Monitoring

Observability != Monitoring

Monitoring focuses on machine‑driven observation of system behavior and output, primarily for detecting and alerting on issues. Observability, on the other hand, emphasizes a self‑examination perspective, allowing engineers to understand why a problem occurs and to ask new questions. Key differences include:

Focus: Monitoring targets specific metric changes and alerts (point‑centric), while observability aims at holistic system understanding (point‑line‑plane).

Timeframe: Monitoring deals with incident detection and post‑mortem (1‑2 days), whereas observability spans the entire development and operation lifecycle.

Goal: Monitoring tells you what happened; observability tells you why it happened and enables deeper investigation.

Monitoring is a subset of observability; the two complement each other.

3. What We Did

Gaode Ride‑Hailing implemented several concrete practices to build an observability system:

3.1 Unified Logging

Logs were categorized into three types: monitoring logs, business logs, and error logs, and a dedicated SDK was provided to enforce consistent log formats. Monitoring logs are isolated for alerting, contain a fixed delimiter "|", and use standardized status labels (success, fail, error). Business logs capture key identifiers (order ID, user ID), descriptive information, and optional auxiliary data. Error logs follow a uniform schema with placeholders for missing fields.

// Define a key to mark start time
MonitorLog mlog = MonitorLog.start("access", "url", "httpcode", "bizcode");
try {
    // doSomeThing1...
    // Mark time scope after something1
    mlog.addTimeScope("time1");
    // doSomeThing2...
    if ("成功") {
        mlog.success(url, httpStatus, response_code);
    } else {
        mlog.faild(url, httpStatus, response_code);
    }
} catch (Exception e) {
    mlog.error(url, httpStatus, response_code);
}

3.2 Distributed Tracing

TraceId is generated using Alibaba’s EagleEye solution, ensuring uniqueness across the entire request chain. The TraceId is propagated through monitoring, service, and error logs, enabling full‑link reconstruction via Alibaba Cloud SLS.

3.3 Monitoring Governance (Metrics)

The monitoring system is divided into five domains:

Infrastructure monitoring (CPU, memory, load, I/O, disk).

Middleware monitoring (using each middleware’s native metrics).

Application & Business monitoring (request volume, latency, success rate – the three golden metrics).

Financial loss monitoring (data consistency, fund safety, especially during promotions).

Monitoring dashboards (layout design, multi‑metric views).

Principles include avoiding "checkbox" monitoring, ensuring each alert is actionable, and using clear naming conventions.

4. Metric Correlation, Topology, and Visualization

After establishing metric hierarchies (primary, secondary, tertiary), visualizations such as dashboards, charts, and topology graphs are built. Correlated metrics are color‑coded; when an alert fires, the full call chain can be traced via the associated TraceId.

5. Observability‑Driven Incident Workflow

Detect the problem via primary metric alerts (e.g., order latency spikes).

Drill down through secondary and tertiary metrics to pinpoint the root cause (e.g., downstream data service latency, host overload).

Mitigate the issue (e.g., replace faulty machines).

Document the resolution as a runbook.

6. Fault Defense Capability

With fine‑grained observability, automated and intelligent decision‑making becomes possible:

Change defense orchestration (different monitoring for business vs. ops changes).

Automated change tracking and notification.

Real‑time inspection of infrastructure metrics.

Active defense with automatic root‑cause recommendation and AI‑driven incident handling.

Full‑domain high‑precision observability enabling unattended fault self‑healing.

Conclusion

In the cloud‑native era, observability is the foundation of system stability. A well‑designed observability stack simplifies complex system states, enhances fault detection and resolution, and builds confidence in large‑scale distributed services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems monitoring Observability Logging fault tolerance Tracing

Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.