Apache Calcite for Multi‑Engine Metric Management: Practices & Roadmap
This article explains the technical principles and best practices of multi‑engine metric management based on Apache Calcite, covering common metric management methods, implementation details of unified SQL, virtual columns, and SQL defined functions, and outlines ByteDance’s future roadmap for extending these capabilities.
Introduction
This article introduces the technical principles and best practices of multi‑engine metric management based on Apache Calcite, including common metric management approaches, implementation principles, and future plans at ByteDance.
Three Key Questions
Do you notice function name differences across Spark and Presto, such as instr vs strpos ?
Are you confused about the keys and meanings within large map or JSON fields?
Do you repeatedly copy the same
CASE WHENlogic across queries?
Common Metric Management Methods
Metrics in OLAP are simply SQL queries over table columns. Typical solutions for handling engine differences include manually adjusting SQL per engine, modifying Presto to align with Hive, or using Apache Calcite to automatically rewrite SQL. The Calcite‑based approach is recommended.
Integer Division Differences
Different engines return different types for integer division. For example, click‑through rate calculated as
clicks / impressionsyields an int in Presto but a double in Hive. Three common solutions are:
Manually change SQL per engine (hard to maintain).
Develop custom Presto extensions (intrusive).
Use Calcite to automatically rewrite SQL (recommended).
Handling Large MAP/JSON Fields
When business needs require rapid metric access, large MAP or JSON columns become hard to maintain. Traditional solutions are registering metadata services outside Hive or maintaining internal documentation, both of which add overhead. Virtual columns provide a low‑cost alternative.
Frequent CASE WHEN Changes
Repeatedly updating
CASE WHENfragments leads to manual notifications, high maintenance, or reliance on UDFs, enumeration, or views—all of which have drawbacks. ByteDance introduces SQL Defined Function to encapsulate reusable logic without copying code.
Best Practices: Two‑Syntax SQL Solution
ByteQuery SQL offers a unified syntax compatible with both Hive and Spark, automatically translating functions (e.g.,
instr↔
strpos) and adjusting data types (int ↔ double). This reduces metric management cost and improves development efficiency.
Virtual Column Practice
Virtual columns act as column‑level views, avoiding the redundancy of full table views. For example, a virtual column
virtual_agedefined as
age + 1is rewritten at execution time to the underlying expression.
SQL Defined Function Practice
SQL Defined Function allows defining reusable SQL expressions as functions. In advertising, a function
classifymaps industry IDs to categories, eliminating the need to copy large
CASE WHENblocks.
Combined Virtual Column & SQL Defined Function
By combining virtual columns with SQL Defined Functions, historical metric logic can be preserved using conditional expressions based on dates, ensuring seamless compatibility across data versions.
Implementation Principles
Unified SQL with Apache Calcite
Calcite parses standard SQL and rewrites it to various dialects, enabling seamless translation between Hive, Presto, Spark, and other engines.
Virtual Columns
Calcite’s support for virtual columns is leveraged by projects like Flink to implement computed columns.
SQL Defined Function
Implementing SQL Defined Function requires extending Calcite’s parser and integrating with Hive’s metadata store via Thrift interfaces.
Future Planning
Support more SQL dialects such as ClickHouse and NoSQL databases.
Introduce independent authorization for virtual columns.
Allow overloading of SQL Defined Functions.
Develop powerful SQL Macros that support joins and aggregates.
Automate discovery of high‑frequency metrics to create virtual columns or functions.
These advancements aim to provide a robust, cloud‑native lakehouse solution through Volcano Engine’s LAS service.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.