Big Data 16 min read

Unify SQL Engine: Integrating Stream, Batch, and Online Computing for Data Warehousing

The article describes how fragmented real‑time, batch, and online data‑warehouse pipelines suffer from low productivity and inconsistent data quality, and introduces a unified SQL engine built on Apache Calcite that parses, optimizes, and compiles a single SQL statement into executable plans for ODPS, Flink, or Java, leveraging Janino code generation, multi‑backend state storage, and snapshot‑join semantics to boost performance and simplify development.

DaTaobao Tech

Aug 11, 2022

Unify SQL Engine: Integrating Stream, Batch, and Online Computing for Data Warehousing

This article discusses the challenges of building a data warehouse that must combine real‑time and offline data processing, highlighting low development efficiency, uncontrolled data quality, and cumbersome data service interfaces.

In current practice, real‑time warehouses are built with Blink/Flink, offline warehouses with ODPS SQL, and online services with Java, resulting in three separate data models and codebases.

Key problems include incompatibility between stream and batch SQL standards, difficulty invoking HSF interfaces from ODPS/Flink, and the need for separate Java development for interactive online queries.

The author proposes a unified stream‑batch solution based on Flink + Kappa architecture, noting its limitations: lack of online interactive support and lower throughput compared to ODPS for large batch jobs.

Two integration approaches are examined: (1) using a single engine for stream and batch while adding Java services for online tasks, and (2) exposing a compute request API to Flink with a polling API for results.

To address these issues, a Unify SQL Engine is built on Apache Calcite, providing SQL parsing, validation, optimization, and translation into execution plans for ODPS, Flink, or Java environments. The engine also employs Janino for code generation, significantly improving execution speed.

Stateful computation is supported via three back‑ends—memory, Redis, and HBase—allowing developers to choose the appropriate storage based on scale and latency requirements.

The engine implements snapshot join semantics to avoid repeated joins on changing dimension tables, ensuring correct real‑time behavior.

Finally, the Unify SQL Engine can translate a single SQL statement into tasks for different runtimes, enabling developers to select the most suitable engine (ODPS for massive offline jobs, Flink for low‑latency streaming, Java for complex logic) and improve overall resource utilization.

The authors belong to the Taobao marketing tools team, responsible for large‑scale data processing and real‑time computation in e‑commerce scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

code generation Flink stream processing Batch Processing Data Warehouse Calcite SQL Engine

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.