Backend Development 13 min read

Design and Evolution of Ctrip Ticket Frontend Trace System for Efficient Debugging

The article describes how Ctrip built and continuously improved a Trace system for its ticket‑front‑end microservices, detailing the challenges of distributed logs, the architecture of the solution, and the functional features such as friendly search, multi‑platform aggregation, page replay, and one‑click mock that together boost debugging efficiency for both developers and non‑technical operators.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Design and Evolution of Ctrip Ticket Frontend Trace System for Efficient Debugging

Author Introduction

Devin and Hank, senior backend engineers at Ctrip, focus on Java and automation research.

1. Introduction

With the widespread adoption of micro‑service architecture, the resulting complex distributed network supports massive queries but also introduces troubleshooting difficulties, especially when users encounter "system exceptions" during ticket booking.

Developers and testers often receive only the exception notice, making it hard to locate the root cause in the intricate call chain. Relying solely on point‑to‑point logs is inefficient, particularly for UI‑layer issues.

The solution is Ctrip's Ticket Front‑end Trace system, which aims to improve investigation efficiency and lower the usage barrier for non‑developers.

2. Trace System Development History

2.1 Original Log‑Based DevOps

Multiple micro‑services generate logs across many topics and dashboards.

Manual aggregation of logs is required to reconstruct user behavior.

Log compression formats differ between teams.

Service names are developer‑centric and hard to read.

Internal systems are not integrated, requiring manual operations.

2.2 Foundation Construction

Ctrip’s ticket front‑end already has a mature logging and automation infrastructure.

Three log types support daily development and operations:

UBT logs – user behavior logs for troubleshooting user feedback.

Metrics logs – business metrics for future requirement evolution.

Trace logs – informational logs such as request payloads and error messages.

Automation facilities (Mock platform and interface automation platform) greatly reduce operational costs.

2.3 Chrome‑Extension Based Trace Tool

The Chrome extension automates repetitive tasks like log decompression, formatting, and quick copying.

2.4 Problems Encountered

Clear visualization of call relationships from logs.

Querying expired logs beyond Elasticsearch retention.

Fast retrieval of target micro‑services via search criteria.

High‑fidelity reconstruction of user booking scenarios.

Integration with other systems (Mock, user‑behavior, etc.).

Improving readability of technically obscure information.

3. System Design

The architecture consists of six layers:

Business layer – front‑end main flow, hotel‑flight business, value‑added services, IBU, etc.

Web layer – search conditions, data display, session replay, one‑click Mock.

Search engine layer – builds ClickHouse SQL from web‑provided conditions.

Data processing layer – assembles ClickHouse data, handles one‑click Mock requests, and returns results to the web front‑end.

Configuration layer – ClickHouse search configs and business field configs.

Log recording – logs generated by each business application.

4. Functional Overview

4.1 Friendly Search Conditions

Search criteria are simplified and business‑wrapped to match user habits, reducing noise compared with raw Kibana queries.

4.2 Log Viewing

Logs stored in Elasticsearch/ClickHouse are displayed in Kibana, but Kibana’s raw query exposure and manual aggregation hinder usability. The Trace system adds noise reduction and hierarchical log processing, supporting automatic decompression and formatting.

4.3 Multi‑Platform Aggregation

Different log types reside on separate platforms; the system uses external links to correlate search data (time, user ID) across platforms, avoiding tight coupling that would increase system complexity.

4.4 Cross‑Business Scenario Aggregation & Expired Log Compensation

One search can aggregate multiple business lines (main flow, low‑price subscription, value‑added products) without manual channel selection. Expired ClickHouse logs are compensated via Hive queries.

4.5 Page Replay Based on CRN_Web

The system collects all service requests for a user session, mocks the responses, and replays the page using CRNWeb technology, providing a high‑fidelity view of the original user experience for operations staff.

4.6 One‑Click Mock

Instead of manually configuring each interface in the Mock system after locating logs in Kibana, the Trace system automatically extracts all involved calls and configures them in seconds, reducing a ten‑minute task to under five seconds.

4.7 Key Information Exposure

Error codes are translated into user‑friendly messages.

4.8 Integrated Reporting System

The Trace system links to the front‑end reporting system, passing exception context (user ID, scenario) to retrieve related logs.

4.9 External Links to Other Platforms

External linking enables data stitching across multiple tools without embedding them directly, preserving the Trace system’s focus on call‑chain tracing.

5. Summary

The Trace system addresses daily development and operations pain points by lowering the cost of debugging, improving self‑service rates for non‑developers, and delivering high‑fidelity page replay, one‑click mock, and multi‑platform aggregation. Since its launch, it has achieved significant productivity gains and reduced both human and communication costs.

backendDebuggingmicroservicesAutomationlog analysistrace system
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.