Big Data 10 min read

Design and Implementation of a Real-Time Advertising Data Warehouse Using Flink and StarRocks

This article presents a comprehensive case study of building a real‑time advertising data warehouse at Auto Home, detailing the evaluation of streaming engines and storage solutions, the layered architecture design, implementation steps with Flink and StarRocks, monitoring practices, encountered issues, and future roadmap, demonstrating how second‑level data freshness and high accuracy were achieved.

HomeTech
HomeTech
HomeTech
Design and Implementation of a Real-Time Advertising Data Warehouse Using Flink and StarRocks

Auto Home, a leading automotive website, generates massive advertising request, exposure, viewable exposure, and click data daily, prompting the need for a real‑time data warehouse to support rapid operational decisions.

The existing offline warehouse, built since 2015, could not meet the latency requirements, so the team evaluated streaming engines (Storm, Spark Streaming, Flink) and storage options (StarRocks, ClickHouse, TiDB, Iceberg) to select the most suitable stack.

Storm was discarded because it only guarantees at‑least‑once delivery, while Flink was chosen over Spark Streaming for its native processing mode, exactly‑once semantics, and stronger platform support within the company.

Among storage engines, ClickHouse, StarRocks, TiDB, and Iceberg were compared; StarRocks was selected for its second‑level latency, support for both detail and pre‑aggregation models, and lower operational cost.

The real‑time warehouse follows a four‑layer OneData design: ODS (raw Kafka/Mysql Binlog ingestion), DWD (detail layer with dimension enrichment and joins in Flink, persisted to StarRocks), DWA (aggregated layer via ETL or materialized views), and APP (business‑oriented datasets for dashboards).

Implementation details include defining Kafka source tables and StarRocks sink tables with native Flink DDL, configuring dynamic partitions, building materialized views in StarRocks for the DWA layer, and exposing the final datasets through an internal OLAP self‑service platform.

Operational stability is ensured by monitoring Kafka‑connectors, Flink jobs, and StarRocks servers with Prometheus and Grafana, and by setting alerts for latency, task restarts, checkpoint failures, and resource usage.

Key issues encountered were JSON sink overload (resolved by switching to CSV), occasional view errors (mitigated by periodic view recreation and planned version upgrades), and cache window sizing (a 4‑hour window achieved >95% accuracy, meeting real‑time needs).

In summary, the Flink + StarRocks framework reduced data latency from hours to seconds, processing over 100 k records per second with >95% real‑time accuracy, and future work will explore StarRocks external tables and continued community engagement to further accelerate the advertising data pipeline.

big dataFlinkstreamingStarRocksreal-time data warehouseAdvertising Analytics
HomeTech
Written by

HomeTech

HomeTech tech sharing

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.