Evolution of 360 Commercial Real-Time Data Warehouse and Apache Doris Deployment
This article details the three‑stage evolution of 360's real‑time data warehouse—from Storm + Druid + MySQL to Flink + Druid + TiDB and finally to Flink + Apache Doris—explaining architectural pain points, the reasons for choosing Doris, and how the new system delivers sub‑second query latency, strong consistency, and simplified operations across advertising scenarios.
360 commercial data team needed a unified, high‑performance real‑time data warehouse for advertising, evolving through three architectures: Storm + Druid + MySQL, Flink + Druid + TiDB, and Flink + Apache Doris.
First Generation Architecture
Storm acted as the stream processor, writing data to Druid for pre‑aggregation; Druid then fed a MySQL materialized view. This design could not satisfy complex pagination or join queries, required periodic MySQL syncs, and struggled to scale with growing data volume.
Second Generation Architecture
Storm was replaced by Flink, and MySQL by TiDB, allowing Flink‑processed data to be written to both Druid and TiDB. While timeliness improved and MySQL maintenance burdens decreased, new issues emerged: TiDB’s transaction overhead at large scale, Druid’s lack of standard SQL, and higher operational complexity due to maintaining two engines.
Third Generation Architecture
The team adopted Flink + Apache Doris as a unified OLAP engine. Doris provides a vectorized query engine, MySQL‑compatible SQL, materialized views, and low‑ops cluster management. Data flows from ODS → DWD → DWT → Doris via Stream Load; offline data is also accelerated by loading DWS layers into Doris via Broker Load.
Reasons for Choosing Doris
Materialized views align with advertising reporting dimensions, ensuring consistency and reducing maintenance costs.
SQL compatibility (MySQL protocol) allows developers, analysts, and product teams to query without learning new interfaces.
High‑performance vectorized engine, columnar storage, and MPP architecture deliver sub‑second query responses.
Automated cluster and replica management lower operational difficulty to near‑zero.
AB Experiment Platform Case Study
The AB experiment scenario generates billions of daily events and millions of OPS. To avoid costly like pattern matching across dozens of experiment tags, the pipeline splits each event into per‑tag records, performs two‑level window aggregation in Flink, and then leverages Doris prefix indexes and materialized views, achieving sub‑second query latency for the majority of experiments.
Data Consistency Guarantees
A custom Flink‑to‑Doris sink implements exactly‑once semantics using Doris’s Stream Load label mechanism combined with Flink’s two‑phase commit. In case of checkpoint failures, the transaction is aborted; successful checkpoints commit the transaction, and failed retries can be recovered via HDFS‑based manual recovery.
Cluster Monitoring
Monitoring uses community‑provided templates for cluster, host, and data‑processing metrics, supplemented with custom metrics for Stream Load volume, latency, and write speed. Alerts are integrated with the company’s notification system (phone, SMS, email, etc.).
Summary and Future Plans
Apache Doris now powers roughly 70 % of 360’s real‑time analytics, handling daily data scales of hundreds of billions with query latency under 1 s. Future work includes expanding Doris clusters, implementing resource isolation, extending Doris to more business scenarios, and deepening contributions to the open‑source community.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.