Big Data 11 min read

Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation

The article introduces Kyligence's Kylin on Parquet solution, explains its plug‑in architecture, reasons for replacing HBase with Parquet, details the new Spark‑based build and query engines, auto‑tuning, global dictionary, fault‑tolerance features, and presents performance comparisons with Kylin 3.0.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation

Apache Kylin originally relied on a plug‑in architecture that used HBase for storage, leading to single‑node query bottlenecks, costly column‑wise compression/decompression, and difficult operations management, especially in cloud environments.

Kyligence addresses these issues with the Kylin on Parquet solution, which retains the modular design but replaces the storage layer with Parquet files and moves both build and query processing to Spark, enabling distributed execution on YARN.

Why Parquet? Parquet provides true columnar storage, eliminates HBase’s single‑point failure, simplifies maintenance, and allows each dimension and measure to be stored as separate columns, reducing serialization overhead.

Build Engine – The new engine runs entirely on Spark, supports automatic parameter tuning, distributed global dictionary construction, and automatic retry of failed tasks. It monitors all stages via Spark UI, can auto‑adjust executor resources, and handles OutOfMemoryError, ClassNotFoundException, and other exceptions with configurable retry limits.

Interface Design – Existing Kylin interfaces (e.g., AbstractExecutable, CubingJob) are reused; a SparkCubingJob creates build segments, performing resource estimation, cube construction, and Parquet storage in a single streamlined step.

Global Dictionary – Built distributively without the integer‑size limitation of Kylin 3.0, it creates distinct value buckets, encodes each bucket, and stores dictionary files with metadata for fast lookup.

Storage – Cuboid data are persisted as Parquet files, with dimensions mapped to unique numeric identifiers; the directory layout isolates segments to avoid write conflicts.

Performance Comparison – In a four‑node YARN cluster (400 GB RAM, 128 cores), the Spark‑based engine reduces storage usage by ~50 % and shortens build time dramatically compared to Kylin 3.0’s MapReduce engine. Query latency on the SSB and TPC‑H benchmarks is also significantly lower, especially for complex queries.

Query Engine – SQL is parsed by Calcite into an AST, optimized, and transformed into Spark execution plans. The engine supports breakpoint debugging of DataFrames, isolates external dependencies, and translates Calcite functions to Spark equivalents.

The article concludes with a demo walkthrough, instructions for trying the open‑source project, and links to the GitHub repository, wiki, and community channels for contribution and support.

performance optimizationBig DataData WarehouseSparkApache Kylinparquet
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.