Databases 10 min read

Practical Experience of Using Apache Doris for Real‑Time Data Warehouse at Tongcheng Data Science

This article details how Tongcheng Data Science built a real‑time analytical data warehouse using Apache Doris, covering business scenarios, the evolution from a legacy 1.0 architecture to a Doris‑based 2.0 design, deployment topology, development workflow, operational benefits, and future roadmap.

DataFunSummit
DataFunSummit
DataFunSummit
Practical Experience of Using Apache Doris for Real‑Time Data Warehouse at Tongcheng Data Science

The presentation introduces the business background of Tongcheng Data Science, a tourism‑industry financial service platform that serves over ten million users across 76 cities, and outlines the four major data‑driven requirements: real‑time dashboards, alerting, analytical queries, and financial reconciliation.

It then describes the original 1.0 architecture, which relied on a mix of StreamSets, Kudu, Impala, Kafka, and Flink, highlighting its advantages (quick CDH integration, visual configuration) and shortcomings (complex component maintenance, long development cycles, poor join performance, resource contention, limited auto‑recovery).

The 2.0 architecture replaces the legacy stack with Apache Doris, using Canal for CDC, Kafka for transport, and Doris Routine Load for ingestion; Doris also simplifies data loading from Hive via Broker Load. Key reasons for choosing Doris include rich data‑source connectors, MySQL‑compatible protocol, MPP parallelism, comprehensive documentation, and strong SQL support.

Doris deployment consists of a two‑layer FE (frontend) and BE (backend) topology that can be installed independently of existing big‑data components, offering high availability, easy scaling, and straightforward operations. The migration to Doris was performed via a rolling upgrade of twelve nodes over three days, with careful handling of FE/BE scaling and decommissioning.

The real‑time system architecture built on Doris integrates data from multiple business lines through Canal and API ingestion, streams data into Kafka, and loads it into Doris via Routine Load. Within Doris, data is organized into three layers: DWD (detail), DWS (summary), and ADS (application), using unique and aggregate models to reduce ETL effort.

Benefits observed after the migration include dramatically reduced data‑ingestion time (3‑5 minutes per table vs. 20‑30 minutes), accelerated ETL development thanks to built‑in models, sub‑second query latency via materialized views and rollup indexes, and lower operational costs compared with Hadoop‑based warehouses.

Future plans involve introducing Doris Manager for cluster management, adopting Flink‑CDC for data ingestion (the 3.0 roadmap), upgrading the Doris cluster with new features, and strengthening metric management and data‑quality monitoring.

Big DataETLreal-time data warehousedata-architectureApache Doris
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.