Databases 10 min read

Practical Experience of Using Apache Doris for Real‑Time Data Warehouse at Tongcheng Data Science

This article details how Tongcheng Data Science built a real‑time analytical data warehouse using Apache Doris, covering business scenarios, the evolution from a legacy 1.0 architecture to a Doris‑based 2.0 design, deployment topology, development workflow, operational benefits, and future roadmap.

DataFunSummit

Jan 24, 2023

Practical Experience of Using Apache Doris for Real‑Time Data Warehouse at Tongcheng Data Science

The presentation introduces the business background of Tongcheng Data Science, a tourism‑industry financial service platform that serves over ten million users across 76 cities, and outlines the four major data‑driven requirements: real‑time dashboards, alerting, analytical queries, and financial reconciliation.

It then describes the original 1.0 architecture, which relied on a mix of StreamSets, Kudu, Impala, Kafka, and Flink, highlighting its advantages (quick CDH integration, visual configuration) and shortcomings (complex component maintenance, long development cycles, poor join performance, resource contention, limited auto‑recovery).

The 2.0 architecture replaces the legacy stack with Apache Doris, using Canal for CDC, Kafka for transport, and Doris Routine Load for ingestion; Doris also simplifies data loading from Hive via Broker Load. Key reasons for choosing Doris include rich data‑source connectors, MySQL‑compatible protocol, MPP parallelism, comprehensive documentation, and strong SQL support.

Doris deployment consists of a two‑layer FE (frontend) and BE (backend) topology that can be installed independently of existing big‑data components, offering high availability, easy scaling, and straightforward operations. The migration to Doris was performed via a rolling upgrade of twelve nodes over three days, with careful handling of FE/BE scaling and decommissioning.

The real‑time system architecture built on Doris integrates data from multiple business lines through Canal and API ingestion, streams data into Kafka, and loads it into Doris via Routine Load. Within Doris, data is organized into three layers: DWD (detail), DWS (summary), and ADS (application), using unique and aggregate models to reduce ETL effort.

Benefits observed after the migration include dramatically reduced data‑ingestion time (3‑5 minutes per table vs. 20‑30 minutes), accelerated ETL development thanks to built‑in models, sub‑second query latency via materialized views and rollup indexes, and lower operational costs compared with Hadoop‑based warehouses.

Future plans involve introducing Doris Manager for cluster management, adopting Flink‑CDC for data ingestion (the 3.0 roadmap), upgrading the Doris cluster with new features, and strengthening metric management and data‑quality monitoring.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data ETL Real-time Data Warehouse Data Architecture Apache Doris

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.