Improving Log Replay Efficiency with Flink and Elasticsearch at Ctrip Ticket Frontend
The article describes how Ctrip's ticket front‑end team replaced a slow, manual log‑pulling process with a Flink‑based real‑time pipeline that streams Kafka data, indexes it in Elasticsearch, and enables second‑level log retrieval for automated scenario replay, dramatically reducing CI cycle time.
Background
As Ctrip's ticket business grew, manual regression testing could no longer keep up with the increasing number of test cases, and the existing log‑pulling solution that cached logs in Redis required half a day per release, becoming a bottleneck for continuous integration.
Introduction
The team introduced a CI pipeline that includes unit tests, traffic replay, and case verification. Traffic replay requires realistic online request results, which are achieved by mocking third‑party services using large volumes of online logs.
Scenario Replay
To cover online business scenarios, user reservation flows are instrumented with trace points that record logs. These logs are then used to mock SOA interfaces and A/B test results, allowing the system to replay and verify responses against real online data.
Refactoring Plan
The new solution uses Flink to consume Kafka streams in real time. Each request receives a unique ID, which links the main service logs with SOA logs. The combined logs are transformed into searchable keywords and stored in Elasticsearch, leveraging its Lucene‑based inverted index for fast retrieval. An alternative backup index strategy is also prepared if Elasticsearch does not meet expectations.
Example of log tag transformation:
{"CaseTag": "11|0|0|0|1|3|1"}After processing, the tags become:
{"c_cus_ct_0": "[1];[2];[8];[11];",
"c_cus_ct_1": "[0];",
"c_cus_ct_2": "[0];",
"c_cus_ct_3": "[0];",
"c_cus_ct_4": "[1];[1];",
"c_cus_ct_5": "[1];[2];[3];",
"c_cus_ct_6": "[1];[1];"}Effect of the New Scheme
Previously, preparing logs for traffic replay took over four hours. With the new indexing approach, log retrieval for each scenario is completed in seconds, enabling near‑real‑time replay and significantly improving release efficiency.
Considerations When Using Flink and Elasticsearch
Flink’s Stream API can cause memory overflow when processing 1–2 TB of daily logs; therefore, TaskManager JVM heap size should be set around 7 GB and Yarn mode is recommended for cluster reliability. Elasticsearch now auto‑creates mappings, but fields containing dots must be handled carefully to avoid type conflicts. Since Elasticsearch runs on the JVM, keep heap usage below 32 GB on 32‑bit JVMs (64‑bit JVMs in JDK 11 alleviate this limit).
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.