Big Data 12 min read

Real-Time Data Processing Frameworks and Kafka Practices at Ctrip Ticketing

This article examines Ctrip Ticket's real-time data processing ecosystem, comparing batch and streaming frameworks such as Hadoop, Spark, Storm, Flink, and Spark Streaming, detailing Kafka deployment and configuration, and describing how these technologies are applied in production for log analysis, seat‑occupancy detection, and anti‑crawling.

Ctrip Technology

Jun 4, 2018

Real-Time Data Processing Frameworks and Kafka Practices at Ctrip Ticketing

Author Introduction Zhang Zhenhua, a senior software engineer in Ctrip's ticketing R&D department, focuses on building and operating the big‑data platform for Ctrip ticketing and developing real‑time and batch applications.

Real‑time Data Landscape Ctrip ticketing generates massive real‑time data including user behavior logs, service request/response logs, and external flight data from GDS, which are crucial for troubleshooting, anomaly detection, and user behavior analysis.

Bounded vs. Unbounded Data Data is classified as bounded (e.g., files in HDFS) or unbounded (e.g., active Kafka topics). Correspondingly, processing approaches are batch processing for bounded data and stream processing for continuous data.

Batch and Stream Frameworks Mature batch frameworks include Hadoop and Spark, while popular stream frameworks are Storm, Spark Streaming, and Flink.

Historical Perspective Batch processing emerged earlier to meet historical data needs; the rise of high‑throughput internet services later drove rapid development of real‑time stream processing frameworks.

1. Stream Processing Frameworks Stream frameworks can be implemented as per‑record processing (Storm, Flink) or micro‑batch processing (Spark Streaming). Storm and Flink process each record immediately, whereas Spark Streaming groups records into small batches.

Framework Comparison Flink offers the lowest latency and highest throughput with native exactly‑once guarantees via state snapshots. Spark Streaming provides exactly‑once via WAL and RDD semantics but is limited to second‑level latency due to micro‑batching. Storm achieves sub‑second latency but only at‑least‑once unless using Trident for exactly‑once.

Choosing a Framework For sub‑second latency requirements, Storm or Flink (especially Flink) are preferred; for less stringent latency, Spark Streaming is suitable because of its ecosystem maturity and SQL support.

2. Kafka Kafka, a distributed publish/subscribe system originally from LinkedIn, stores data in partitioned logs on broker nodes coordinated by Zookeeper. Its high throughput stems from sequential disk writes and zero‑copy sendfile optimizations. A recommended production configuration includes CentOS 7.1, 12‑core CPUs, 48 GB RAM, 4 × 4 TB disks, G1 GC, and appropriate Zookeeper settings.

Operational Tips for Kafka Key practices include matching partition count to broker count, careful topic size reduction, increasing max.message.bytes for large payloads, enabling compression, expanding partitions when scaling, isolating a dedicated SOA write service, exposing JMX for monitoring, cleaning dead consumers, enabling auto.leader.rebalance, and tuning num.io.threads.

3. Ctrip Ticket Real‑Time Data Processing Architecture The architecture uses Storm and Spark Streaming (no Flink) to satisfy different latency needs. Spark Streaming parses ticket search logs from Kafka Direct Stream, expands compressed payloads, stores results in Hive (≈60 billion records daily), and supports AB‑testing metrics. Storm processes order logs to detect seat‑occupancy fraud, writing results to Redis for the ordering service.

Additional Components Redis + Presto enables dynamic SQL‑based monitoring by storing per‑minute Kafka data as JSON lists keyed by timestamps. Logstash forwards service logs to Elasticsearch, with a Redis‑based secondary index for fast cross‑log correlation. Front‑end telemetry and backend access logs are synced to TimescaleDB for anti‑crawling rules and machine‑learning‑driven IP detection.

Conclusion With the rapid evolution of open‑source big‑data frameworks, selecting the most suitable technology for specific use cases is essential, and continuous monitoring of community developments remains important.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Processing Flink Spark Streaming Storm streaming frameworks

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.