Big Data 16 min read

Unify SQL Engine: Integrating Stream, Batch, and Online Computing for Data Warehousing

The article describes how fragmented real‑time, batch, and online data‑warehouse pipelines suffer from low productivity and inconsistent data quality, and introduces a unified SQL engine built on Apache Calcite that parses, optimizes, and compiles a single SQL statement into executable plans for ODPS, Flink, or Java, leveraging Janino code generation, multi‑backend state storage, and snapshot‑join semantics to boost performance and simplify development.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Unify SQL Engine: Integrating Stream, Batch, and Online Computing for Data Warehousing

This article discusses the challenges of building a data warehouse that must combine real‑time and offline data processing, highlighting low development efficiency, uncontrolled data quality, and cumbersome data service interfaces.

In current practice, real‑time warehouses are built with Blink/Flink, offline warehouses with ODPS SQL, and online services with Java, resulting in three separate data models and codebases.

Key problems include incompatibility between stream and batch SQL standards, difficulty invoking HSF interfaces from ODPS/Flink, and the need for separate Java development for interactive online queries.

The author proposes a unified stream‑batch solution based on Flink + Kappa architecture, noting its limitations: lack of online interactive support and lower throughput compared to ODPS for large batch jobs.

Two integration approaches are examined: (1) using a single engine for stream and batch while adding Java services for online tasks, and (2) exposing a compute request API to Flink with a polling API for results.

To address these issues, a Unify SQL Engine is built on Apache Calcite, providing SQL parsing, validation, optimization, and translation into execution plans for ODPS, Flink, or Java environments. The engine also employs Janino for code generation, significantly improving execution speed.

Stateful computation is supported via three back‑ends—memory, Redis, and HBase—allowing developers to choose the appropriate storage based on scale and latency requirements.

The engine implements snapshot join semantics to avoid repeated joins on changing dimension tables, ensuring correct real‑time behavior.

Finally, the Unify SQL Engine can translate a single SQL statement into tasks for different runtimes, enabling developers to select the most suitable engine (ODPS for massive offline jobs, Flink for low‑latency streaming, Java for complex logic) and improve overall resource utilization.

The authors belong to the Taobao marketing tools team, responsible for large‑scale data processing and real‑time computation in e‑commerce scenarios.

code generationFlinkstream processingBatch Processingdata warehouseCalciteSQL engine
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.