Big Data 9 min read

Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0

This article details the evolution of 58 Tongcheng Bao's real‑time data warehouse, describing the initial Spark‑Streaming architecture, its limitations, and the redesign using Flink with a layered ODS‑DWD‑DWS‑APP model, data‑quality monitoring, join techniques, and the resulting improvements in latency and accuracy.

58 Tech
58 Tech
58 Tech
Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0

58 Tongcheng Bao, a leading lifestyle service platform, processes massive advertising and user data and needed a real‑time data warehouse to support fast decision‑making.

Early offline warehouses performed batch ETL, but growing demand for instant insights led to the creation of a real‑time warehouse, first built on Spark Streaming (version 1.0) that consumed Kafka streams and wrote results to Druid.

Version 1.0 suffered from micro‑batch latency, process‑time joins causing data loss, high task maintenance cost, and an inflexible data‑layer structure.

To overcome these issues, version 2.0 replaced Spark Streaming with Flink and introduced a four‑layer architecture: ODS (raw), DWD (detail), DWS (summary), and APP (service), each organized by business domain.

Data‑quality management was added, covering completeness, consistency, timeliness, and accuracy, with full‑life‑cycle monitoring and alerting for each processing stage.

Flink implementations include double‑stream joins and interval joins, dramatically reducing the number of real‑time tasks, eliminating state‑order problems, and achieving >99% accuracy for key metrics such as click‑through and cash flow.

Benchmark results show the new warehouse delivers second‑level latency and high data‑accuracy, while the architecture supports scalable, maintainable development.

The team plans to continue expanding domain coverage, refining Flink usage, and advancing data‑intelligence capabilities.

Big DataFlinkStream ProcessingKafkadata qualityReal-time Data WarehouseSpark Streaming
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.