Big Data 18 min read

Streaming‑Batch Integrated Real‑time Multi‑dimensional Analysis

This article presents a comprehensive overview of evolving big‑data architectures—from classic offline warehouses to Lambda and Kappa models—and details a streaming‑batch integrated solution that addresses latency, data freshness, and multi‑table join challenges to achieve minute‑level real‑time multi‑dimensional analytics.

DataFunSummit
DataFunSummit
DataFunSummit
Streaming‑Batch Integrated Real‑time Multi‑dimensional Analysis

Introduction – The session introduces a streaming‑batch integrated real‑time multi‑dimensional analysis solution, outlining four parts: evolution of big‑data architectures, business pain points and the integrated solution, implementation challenges, and future plans.

1. Big‑Data Architecture Evolution

Classic offline data‑warehouse architecture consists of layers ODS → DWD → DWS → ADS, offering simplicity and low cost but suffering from data latency, lack of real‑time data, and excessive table counts.

Lambda architecture adds a Speed layer (real‑time) on top of the Batch layer, using Kafka, Storm, Spark Streaming, or Flink, and a Server layer to merge results. Its drawbacks include duplicated code bases, higher resource consumption, and data divergence between batch and speed streams.

Kappa architecture removes the Batch layer, using a single code path for both real‑time and offline processing, but faces challenges with data back‑fill, high throughput, and costly schema migrations.

2. Streaming‑Batch Integrated Solution

The legacy architecture (a Lambda‑style setup) suffers from numerous ODS tables, heavy join complexity, and weak real‑time analysis capabilities.

The new solution adopts a hybrid Lambda‑Kappa design, keeping data source and storage unchanged while revamping data cleaning and warehousing. Fields are routed to real‑time or offline streams based on latency requirements, merging into a single wide table with minute‑level versions. This reduces table count, simplifies queries, and achieves 5‑20 minute end‑to‑end latency, meeting minute‑level ingestion and second‑level query needs.

3. Key Problem Breakthroughs

DB data update issue – Binlog changes are captured, written to a message queue, and processed into minute‑level Delta files. A Copy‑On‑Write approach merges Base and Delta files every 5 minutes, favoring query performance over ingestion speed.

Multi‑table join issue – A three‑step join process uses small Delta files with MapJoin to combine with large Base files, producing temporary Delta and Base files before generating the final queryable version.

DB‑log join issue – Logs are streamed, and DB data is cached for fast lookup during stream processing; hot data is merged frequently while cold data is merged daily to balance cost and latency.

Data availability timing – Wide tables are versioned per field; time‑sensitive fields are updated every minute, while less critical fields follow T+1 or T+2 schedules.

4. Summary and Planning

Architecture selection should solve concrete business problems with minimal cost. Future work includes continuous query engine performance optimization and improving the user experience of upstream query tools.

Q&A

Answers clarify that real‑time and offline data are stored in a single versioned wide table, Delta files are indeed joined with Base files using a three‑step process, high‑performance caching relies on file‑system storage for cost reasons, and the data warehouse uses an internally developed engine.

Big Datareal-time analyticsbatch processingStreamingData WarehouseLambda architectureKappa architecture
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.