Case Study: Building a Real‑Time Log Data Analysis Platform with Apache Doris at China Unicom
This article describes how China Unicom’s Western Innovation Research Institute designed and deployed a centralized, real‑time log analytics platform using Apache Doris, detailing the migration from Hive and ClickHouse, performance optimizations, storage cost reductions, and the resulting improvements in data ingestion, query speed, and operational efficiency.
China Unicom generates billions of log entries daily, which are critical for network security and system reliability. To manage and analyze this massive data, the Western Innovation Research Institute built a centralized log analysis platform that supports automated collection, storage, management, analysis, and visualization.
The first phase used an offline Hive data warehouse with Spark and DolphinScheduler, but suffered from high latency (up to 10 minutes) and limited concurrency. ClickHouse, used as the OLAP engine, also faced issues such as insufficient concurrency, poor multi‑table join performance, high update costs, and high operational overhead.
In the second phase, the team replaced Hive and ClickHouse with Apache Doris, creating a unified real‑time data warehouse. The architecture includes:
ODS layer: raw data ingested via Flume into Kafka and HDFS.
DWD layer: real‑time cleaning, standardization, and enrichment using Flink, stored in Doris with Duplicate Key model.
DWS layer: fine‑grained aggregation using dynamic rule engine.
ADS layer: business‑specific analysis using Doris Aggregate Key and Unique Key models.
Key capabilities and optimizations include:
High‑throughput ingestion using Doris Flink Connector (20‑30 k rows/second).
Flink checkpoint interval tuning to reduce version accumulation.
Doris BE parameter adjustments and increased CPU for compaction.
ZSTD compression, cold‑hot data tiering, and partition‑level replica settings, achieving ~50% storage cost savings.
Dynamic partitioning and bucketing to handle data growth.
Materialized views and Aggregate tables to accelerate queries across data sizes ranging from <100 GB to hundreds of TB, reducing query times to seconds or milliseconds.
These improvements enable near‑real‑time (minute‑level) and even millisecond‑level query responses, supporting over 30 business lines and hundreds of real‑time jobs, with daily log ingestion at the hundred‑billion level and petabyte‑scale storage.
The platform also benefits from new Doris 2.0 features such as inverted indexes and cold‑hot data layering, further enhancing query efficiency and reducing storage costs. Ongoing work includes testing inverted indexes and proposing enhancements to automatic bucket rules.
Overall, the migration to Apache Doris has delivered faster data ingestion, significant storage savings, and query performance improvements of more than tenfold, demonstrating the value of a modern, real‑time analytical database for large‑scale log analytics.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.