Big Data 16 min read

Apache Calcite for Multi‑Engine Metric Management: Practices & Roadmap

This article explains the technical principles and best practices of multi‑engine metric management based on Apache Calcite, covering common metric management methods, implementation details of unified SQL, virtual columns, and SQL defined functions, and outlines ByteDance’s future roadmap for extending these capabilities.

ByteDance Data Platform

Sep 18, 2024

Apache Calcite for Multi‑Engine Metric Management: Practices & Roadmap

Introduction

This article introduces the technical principles and best practices of multi‑engine metric management based on Apache Calcite, including common metric management approaches, implementation principles, and future plans at ByteDance.

Three Key Questions

Do you notice function name differences across Spark and Presto, such as instr vs strpos ?

Are you confused about the keys and meanings within large map or JSON fields?

Do you repeatedly copy the same CASE WHEN logic across queries?

Common Metric Management Methods

Metrics in OLAP are simply SQL queries over table columns. Typical solutions for handling engine differences include manually adjusting SQL per engine, modifying Presto to align with Hive, or using Apache Calcite to automatically rewrite SQL. The Calcite‑based approach is recommended.

Integer Division Differences

Different engines return different types for integer division. For example, click‑through rate calculated as clicks / impressions yields an int in Presto but a double in Hive. Three common solutions are:

Manually change SQL per engine (hard to maintain).

Develop custom Presto extensions (intrusive).

Use Calcite to automatically rewrite SQL (recommended).

Handling Large MAP/JSON Fields

When business needs require rapid metric access, large MAP or JSON columns become hard to maintain. Traditional solutions are registering metadata services outside Hive or maintaining internal documentation, both of which add overhead. Virtual columns provide a low‑cost alternative.

Frequent CASE WHEN Changes

Repeatedly updating CASE WHEN fragments leads to manual notifications, high maintenance, or reliance on UDFs, enumeration, or views—all of which have drawbacks. ByteDance introduces SQL Defined Function to encapsulate reusable logic without copying code.

Best Practices: Two‑Syntax SQL Solution

ByteQuery SQL offers a unified syntax compatible with both Hive and Spark, automatically translating functions (e.g., instr ↔ strpos) and adjusting data types (int ↔ double). This reduces metric management cost and improves development efficiency.

Virtual Column Practice

Virtual columns act as column‑level views, avoiding the redundancy of full table views. For example, a virtual column virtual_age defined as age + 1 is rewritten at execution time to the underlying expression.

SQL Defined Function Practice

SQL Defined Function allows defining reusable SQL expressions as functions. In advertising, a function classify maps industry IDs to categories, eliminating the need to copy large CASE WHEN blocks.

Combined Virtual Column & SQL Defined Function

By combining virtual columns with SQL Defined Functions, historical metric logic can be preserved using conditional expressions based on dates, ensuring seamless compatibility across data versions.

Implementation Principles

Unified SQL with Apache Calcite

Calcite parses standard SQL and rewrites it to various dialects, enabling seamless translation between Hive, Presto, Spark, and other engines.

Virtual Columns

Calcite’s support for virtual columns is leveraged by projects like Flink to implement computed columns.

SQL Defined Function

Implementing SQL Defined Function requires extending Calcite’s parser and integrating with Hive’s metadata store via Thrift interfaces.

Future Planning

Support more SQL dialects such as ClickHouse and NoSQL databases.

Introduce independent authorization for virtual columns.

Allow overloading of SQL Defined Functions.

Develop powerful SQL Macros that support joins and aggregates.

Automate discovery of high‑frequency metrics to create virtual columns or functions.

These advancements aim to provide a robust, cloud‑native lakehouse solution through Volcano Engine’s LAS service.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL Rewrite metric management Apache Calcite Virtual Columns SQL Defined Function

Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.