Backend Development 16 min read

Comprehensive Log Governance and Mining Solution for Distributed Systems

This article presents a comprehensive log governance and mining solution for distributed systems, covering background challenges, usage scenarios, and detailed strategies such as distributed log integration, front‑back end traceability, standardized log management, large‑payload handling, efficient cleaning, and future plans for componentization and sampling.

Ctrip Technology

May 17, 2024

Comprehensive Log Governance and Mining Solution for Distributed Systems

Author Introduction

Seren, senior R&D manager at Zhixing, is responsible for business system architecture upgrades and optimization, and for tracking industry trends and technology directions.

Phoenix, senior backend development engineer at Zhixing, focuses on system performance optimization and business data governance and mining, continuously driving business development.

1. Background

Logs, as faithful records of system operation, are not only powerful tools for problem tracing but also a compass for performance tuning. By deeply analyzing logs we can uncover every detail of system behavior, quickly locate issues, and optimize performance. Logs also serve as important data for analysis and decision‑making, yet several difficulties remain in the R&D process.

To troubleshoot problems we often need to record massive amounts of logs; correlating all logs of a client request while ensuring completeness and continuity is challenging.

Logging objects that are too large can cause frequent garbage collection, leading to server instability.

Loss of core logs makes problem diagnosis extremely difficult.

Although log transmission is already asynchronous, excessive sending still consumes CPU, memory and storage resources, and redundant logs waste valuable storage.

To address these issues, this article proposes a log governance and mining solution that standardizes, normalizes, and unifies log handling to further extract the latent value of system logs.

2. Considerations

We have organized log usage scenarios into four directions: metric monitoring, trace‑based troubleshooting, performance analysis, and data analysis & reporting (real‑time/offline).

Metric Monitoring : Real‑time tracking and analysis of key indicators such as core business logic, third‑party interface responses, and data validity, ensuring the system runs in an optimal state.

Trace‑Based Troubleshooting : In complex business scenarios, when request processing exceptions, system errors, or business logic mismatches occur, precise log‑based tracing becomes essential for quickly locating and resolving faults.

Scenario‑Based Problem Localization : By deeply analyzing logs, monitoring data, and user feedback for specific scenarios, we can narrow down the root cause of issues.

Data Analysis & Reporting : Leveraging log‑derived data for real‑time insight enables accurate problem detection, rapid resolution, and clear communication of analysis results to teams, decision‑makers, and BI stakeholders.

3. Solution

3.1 Integration and Correlation of Distributed System Logs

Linking logs across components in a distributed environment requires several key steps:

1) Unique Identifier Generation: Generate a globally unique identifier (traceId) at the start of each business request, which persists throughout the request lifecycle and tags all related logs.

2) Identifier Propagation: Ensure the traceId is carried across all services and components as the request flows.

We generate the traceId at the request entry point, store it in a thread‑safe context, and each component retrieves the traceId from this context when logging, achieving intra‑service log correlation.

When invoking downstream services, the current request’s traceId is added to a custom header, allowing downstream components to extract and store it in their own context, thus achieving inter‑service log correlation.

3.2 Front‑Back End Log Integration Solution

The goal is to link front‑end and back‑end logs so that a user’s operation lifecycle can be traced end‑to‑end. Key steps:

1) Generate a unique identifier (traceId) at the request entry point (e.g., API gateway or load balancer).

2) Pass traceId to the front‑end: Include the generated traceId in the response header; the front‑end extracts and stores it.

3) Front‑end returns traceId: On the next request, the front‑end sends the stored traceId back to the server, enabling correlation with the previous request.

3.3 Unified Standard Log Management and Modularization

This solution builds a unified, maintainable, and extensible logging, querying, and analysis system to improve observability and troubleshooting.

1) Layered Log Standards: Classify logs into layers (application, business logic, data access, external interface, etc.) and record key information per layer.

2) Unified Log Format: Define standard fields such as timestamp, traceId, level, source, request info (method, URL, parameters), response info (status code, data), and exception stack.

3) Unified Ingestion Method: Standardize collection, transmission, and storage processes, specifying data formats (JSON, XML) and transport mechanisms.

4) Log Analysis Tools: Develop tools for processing, statistical analysis, visualization, and anomaly detection.

5) Security & Performance Optimization: Encrypt sensitive data, compress logs to reduce storage and network overhead.

3.4 Fine‑Grained Management of Large‑Payload Logs

Handling large log messages requires preserving completeness and readability while minimizing performance impact.

1) Identify Large Logs: Set memory usage thresholds based on system limits and business needs; treat logs exceeding the threshold as large objects.

2) Asynchronous Compression & Sending: Use a thread pool to offload compression and transmission to background tasks, keeping the main service responsive.

3) Choose Appropriate Compression Algorithm: Use efficient algorithms such as Gzip or ZSTD that balance compression ratio and speed, ensuring they can handle large payloads without causing memory overflow.

These steps ensure integrity and readability of large logs while reducing performance impact.

3.5 Efficient Log Cleaning and Multi‑Dimensional Analysis

The cleaning and analysis solution extracts useful information, transforms logs into structured data, and supports downstream monitoring, troubleshooting, and reporting.

1) Log Processing: Collect raw logs, decrypt/compress as needed, filter out irrelevant or duplicate entries.

2) Key Field Extraction: Use the Aviator engine to run extraction scripts, pulling fields such as interface name and success flag, then standardize them.

3) Dimensional Aggregation: Define dimensions (time, event type, etc.) and aggregate statistics (counts, averages, max) stored in ClickHouse for query.

4) Visualization: Design dashboards and reports that present key metrics and trends, with interactive features for deep analysis.

4. Future Planning

Componentization : Provide a zero‑intrusion integration approach where logging components can be added without modifying existing systems, preserving stability and simplifying onboarding.

Configuration : Extend key process interfaces to allow flexible, customizable component configuration for diverse business scenarios.

Sampling : Apply sampling to core logs to retain representative data while dramatically reducing storage costs.

Resource Consumption Reduction : Optimize serialization formats, batch log sending, and other techniques to lower transmission volume and network overhead.

Team Position Recruitment

Senior Web Front‑End Engineer

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.