Databases 8 min read

Comparative Analysis of Kylin, Druid, and ClickHouse Data Models and Index Structures

This article compares the data modeling and indexing techniques of three mainstream OLAP engines—Kylin, Druid, and ClickHouse—explaining their conversion processes, storage formats, query execution paths, and summarizing their strengths and limitations for aggregation and detail query scenarios.

DataFunTalk
DataFunTalk
DataFunTalk
Comparative Analysis of Kylin, Druid, and ClickHouse Data Models and Index Structures

The article introduces Kylin, Druid, and ClickHouse as the current mainstream OLAP engines and sets out to analyze their core technologies from the perspectives of data model and index structure.

Kylin Data Model : Kylin transforms a two‑dimensional Hive table into a Cube and then stores the Cube in HBase. The first transformation is a traditional cube operation (CuboId represents a fixed‑column data set). Example SQL: select A, B, sum(M), sum(N) from table group by A, B . The second transformation serializes CuboId and dimension information into the HBase rowkey and stores metric columns as column families, performing pre‑aggregation during the load.

Kylin Index Structure : Because data resides in HBase, Kylin uses HBase’s simplified B+‑tree index (HFile). The index is a clustered index sorted by rowkey, typically two or three levels deep, with default node size of 64 KB. Lookup follows the tree to locate the rowkey‑ordered data.

Druid Data Model : Druid performs pre‑aggregation by grouping on all dimensions (full group‑by) rather than building explicit cubes. The model stores the aggregated results directly.

Druid Index Structure : Druid employs a column‑oriented storage where each column has its own file. Dimension columns are indexed using Bitmap indexes. For each distinct value a bitmap of length equal to the total row count is generated, enabling fast bitwise operations during query execution. Example query: select site, sum(pv) from xx where date=2020-01-01 and city='bj' group by site . The query process involves locating the bitmap for the target dimension value, intersecting bitmaps, and finally aggregating the metric column.

ClickHouse Index Structure (MergeTree engine) : ClickHouse stores data in ordinary two‑dimensional tables but uses a column‑oriented index. Selected columns become index columns; the data file is sorted by these columns. Every 8192 rows a “mark” is recorded, storing the offset and mark number for each column. Query execution uses binary search on the primary index to locate relevant marks, then retrieves the required rows from the data blocks. Example query: select count(distinct action) where date=toDate(2020-01-01) and city='bj' . The process includes finding the relevant marks, intersecting row sets for the indexed columns, and finally reading the target column values.

Overall Summary : Kylin and Druid are suited for aggregation‑heavy workloads, while ClickHouse can handle both detail and aggregation scenarios. In aggregation queries, performance ranking is Kylin > Druid > ClickHouse. All three engines require domain experts for schema design, and ClickHouse offers the best vectorized execution support, whereas Druid has limited vectorization and Kylin currently lacks it.

big dataIndexingdata modelingClickHouseOLAPDruidKylin
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.