Databases 12 min read

Building a Real‑Time Data Warehouse with Apache Doris: Architecture, Benefits, and Lessons Learned

This article details how a fast‑growing supply‑chain platform migrated from MySQL and Hive to Apache Doris for real‑time analytics, describing the architectural evolution, the advantages of the new design, practical implementation steps, encountered challenges, and the performance and cost benefits achieved.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Building a Real‑Time Data Warehouse with Apache Doris: Architecture, Benefits, and Lessons Learned

Wu Fan, head of data at WuYi Cloud, introduces the business background: WuYi Cloud has become a leading domestic industry‑internet supply‑chain platform with annual transaction volume exceeding 200 billion CNY, and rapid growth created urgent real‑time data analysis requirements.

Initially the company used MySQL as a BI warehouse and later a CDH‑based stack (Canal → Kafka → HBase → Hive) to handle incremental loads, but the solution could not keep up with data volume, suffered from stale data, and could only provide T+1 offline results.

In 2021 the team evaluated several products and selected Apache Doris as a real‑time data warehouse. Doris enables direct stream loading from MySQL binlog via Flink CDC, supports real‑time OLAP, and can replace Hive offline jobs.

New Architecture Advantages

Simple data processing pipeline: Flink CDC reads both incremental and full data and writes to Doris via Stream Load.

One source, real‑time full data: eliminates duplicated MySQL‑based storage and provides up‑to‑date data for reporting.

Easy deployment and maintenance compared with a Hadoop ecosystem.

One‑click full‑library ingestion: the custom data‑easy platform generates Flink CDC jobs on Yarn and automatically applies online schema changes.

Second‑level query latency: Doris queries finish in seconds versus minutes for Hive, reducing batch processing time from hours to under two.

Key System Functions

Data Ingestion : Users select MySQL instances, choose full‑library or table‑level ingestion, and the platform creates Flink CDC tasks that continuously sync binlog changes to Doris, handling DDL via online schema change.

Data Computation : Users write SQL scripts in the data‑easy platform; the system parses source tables, generates downstream tasks, and schedules them with the Dolphin scheduler. Daily ODS→DWD incremental jobs keep T+1 reports consistent.

Problems and Experience

MySQL‑Doris type mismatches (e.g., Blob, Mediumint) required manual conversion.

DDL incompatibilities (e.g., unsigned bigint, AUTO_INCREMENT) needed adaptation.

Large‑table joins caused memory pressure on Doris BE nodes; newer Doris versions improved memory control.

Hive‑to‑Doris script migration revealed unsupported features such as LATERAL VIEW and certain UDFs.

Analysis‑function bugs (e.g., window functions) were resolved in later Doris releases.

Dynamic partitioning required pre‑defined date partitions.

Frequent Stream Load writes triggered error 235; batching and newer Doris versions mitigated the issue.

All reported issues have been addressed by the Apache Doris community and are slated for upcoming releases.

Final Thoughts

Adopting Doris dramatically shortened report update times, cut query latency, saved tens of thousands of dollars in cluster expansion costs, and earned cross‑department recognition. The platform now supports rapid data‑source onboarding, zero‑code schema sync, and direct MySQL‑compatible data services for downstream visualization tools.

Images illustrating the architecture, ingestion UI, and workflow are included throughout the original article.

big dataMySQLdata integrationReal-time Data WarehouseFlink CDCApache Doris
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.