SuperSQL: A High‑Performance Cross‑Engine, Cross‑Data‑Center SQL Middleware for Big Data
The article introduces SuperSQL, a federated SQL middleware that unifies heterogeneous data sources across multiple data centers, leverages Apache Calcite for cost‑based optimization, pushes down operators to various engines, manages metadata with a Trie model, and demonstrates significant performance gains over traditional solutions.
Since 2010, many companies have built applications on independent open‑source big‑data components such as Spark, Flink, or HBase, but after 2020 a single component can no longer satisfy the data needs of large internet firms like Tencent, Alibaba, NetEase, or Didi. Data is distributed across private clouds, public clouds, and isolated clusters, raising the question of how SuperSQL can compute across engines and data centers.
The presentation covers nine topics: background goals, industry competitors, system architecture, metadata management, cross‑source SQL operator push‑down, distributed compute engine, cross‑DC query optimization, performance evaluation, and future plans.
Background & Goals
Heterogeneous data sources with different versions (e.g., Hive 1.0 vs 2.0) cannot be queried uniformly.
Legal requirements such as GDPR and China’s data law demand that data stay within its originating data center, necessitating a unified, real‑time query solution.
SuperSQL Achievements – Cross‑Business Platform (Part 1)
Saved 5‑10 developer‑person‑months and 1‑2 hours of data‑import time.
Improved query‑analysis performance by more than three times.
Provided a single entry point for managing all data sources.
Eliminated data‑consistency issues by avoiding ETL and ensuring a single source of truth.
SuperSQL Achievements – Cross‑Business Platform (Part 2)
By pushing computation from a historical data warehouse (e.g., HDFS) to an OLAP engine like ClickHouse, SuperSQL removes the need for intermediate data movement, reduces development effort, and decouples systems.
Industry Competitors – Open‑Source
SparkSQL suffers from DAG‑based limitations and lacks full SQL support.
Presto struggles with large joins and out‑of‑memory errors.
Apache Drill, built on Calcite, lacks robust multi‑engine integration.
Industry Competitors – Extensions and Proprietary Solutions
Huawei Pollux has limited extensibility and is not fully open‑source.
QuickSQL (also based on Calcite) only supports single‑source JDBC push‑down.
Google F1‑query is internal‑only; Alibaba Data Lake Analytics lacks flexible data‑source hierarchy management.
Huawei Hetu and Contiamo Workbench are not open‑source and have limited feature sets.
SuperSQL vs. Commercial Competitors
SuperSQL distinguishes itself by supporting rule‑based multi‑source push‑down, cost‑based (CBO) and heuristic‑based (HBO) optimization, and a rich metadata module.
System Architecture
SuperSQL’s core is built on Apache Calcite, an industrial‑grade SQL parser, validator, and optimizer that converts SQL to relational algebra, applies Volcano‑model optimization, and rewrites the plan for push‑down to target engines. Calcite also provides connectors for diverse data sources (Spark, MongoDB, Elasticsearch, etc.) via JDBC.
The “federated government” analogy illustrates how a user query is delegated to multiple data‑center “states”, each selecting the optimal engine (Spark, Presto, etc.) and performing local computation before merging results.
Example: a query selecting from DC1 UNION ALL SELECT from DC2 generates two sub‑queries (view1, view2) that are pushed down independently and then combined.
Metadata Management
SuperSQL models metadata as a Trie hierarchy: the root represents a virtual node, leaves are tables, and intermediate nodes capture data‑center, database, and schema levels, enabling precise source identification even with duplicate table names.
Three advantages: universality, flexibility, lightweight.
CBO statistics collection gathers min, max, null count, distinct count, average length, etc., for each column.
Statistics are persisted in Hive Metastore; the ANALYZE command samples source tables and stores segment‑level metrics.
Column‑level statistics are merged across samples (e.g., taking the maximum of max values).
Cross‑Source SQL Operator Push‑Down
Supports a wide range of data sources (Hive, SparkSQL, PostgreSQL, TBase, MySQL, Oracle) and compute engines (Hive, Flink, Spark, Presto).
Push‑down reduces network I/O, balances engine load, and enables over 60 operator rules (AVG, MAX, MIN, ORDER BY, UNION ALL, JOIN, etc.).
Examples include JOIN push‑down where operations on the same source are merged before execution, and UNION ALL aggregation push‑down to minimize data transfer.
Concurrent sub‑query slicing (divide‑and‑conquer) splits large partitions into smaller queries for parallel execution.
Distributed Compute Engine
SuperSQL abstracts the underlying engine, allowing Spark, Flink, Presto, or others to be used without invasive changes. It creates temporary views for each source, executes sub‑plans on the respective engines, and merges results in a central engine.
Cross‑DC Query Optimization
Multi‑stage Volcano planner incorporates timeout, dead‑lock detection, and network‑bandwidth awareness.
DC‑level CBO considers intra‑DC bandwidth; cross‑DC CBO adds inter‑DC bandwidth and selects the optimal engine based on cost.
Performance Evaluation
Tests on six 128 GB / 48‑core servers using Spark, Hive, PostgreSQL, and Hadoop show that SuperSQL achieves up to 26× speed‑up for single‑source queries and 5× for cross‑source queries compared with baseline SparkSQL JDBC. TPC‑DS benchmarks confirm superior scalability as data volume grows.
Future Roadmap
Introduce HBO (historical‑based optimization) using SQL fingerprinting and machine‑learning to select the best engine based on past performance.
Refactor metadata storage to a dedicated service, enabling unified CBO/RBO/HBO pipelines.
Incorporate multi‑cost factor optimization, considering compute‑resource pricing and business constraints.
Integrate OLAP engines (Elasticsearch, ClickHouse) for end‑to‑end analytics without manual ETL.
Thank you for attending.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.