Databases 9 min read

Deep Dive into Apache Druid V1 Data Storage Format: Index Structures and Disk Layout

This article provides an in‑depth analysis of Apache Druid V1's column‑oriented storage format, covering dimension structures, dictionaries, variable‑length integer encoding, inverted indexes, array handling, and how these components are used during query execution.

DataFunTalk
DataFunTalk
DataFunTalk
Deep Dive into Apache Druid V1 Data Storage Format: Index Structures and Disk Layout

Apache Druid is a high‑performance OLAP engine that relies on a custom column‑oriented storage format to achieve sub‑second queries on massive datasets. The article examines the V1 storage format, focusing on index structures and on‑disk representation.

Dimension Data Structure

Druid stores each column as a separate logical file; dimensions are indexed while metrics are stored as raw row values. Using a sample advertising‑effect dataset, the article illustrates the logical storage of a "city" dimension and two metrics.

Dictionary

The dictionary de‑duplicates column values, sorts them, and assigns each a numeric code equal to its array index. This enables fixed‑length integer encoding, reducing storage and allowing constant‑time offset calculations.

Encoded Dimension Values

Encoded values are stored as variable‑length integers whose byte length depends on the number of distinct values in the column. The encoding scheme is:

1 – 2^8‑1   => 1 byte
2^8 – 2^16‑1 => 2 bytes
2^16 – 2^24‑1 => 3 bytes
2^24 – 2^32‑1 => 4 bytes
2^32 – 2^40‑1 => 5 bytes

For the "city" dimension with three unique values, each code occupies one byte.

Inverted Index

Each dictionary entry has an associated bitmap indicating which rows contain that value. Bitmaps are compressed to save space, resulting in variable‑length storage.

Array Dimensions

Druid also supports array‑type dimensions. Their storage still uses a dictionary and inverted index, but the encoded values form a two‑level structure: an outer variable‑length list whose elements are inner fixed‑length lists.

Storage Summary

The physical layout depends on whether elements are fixed‑length or variable‑length, influencing whether data is stored as fixed‑length blocks or with offset tables. Figure 6 (not reproduced) summarizes these patterns.

version: 1 byte
allowReverseLookup: 1 byte
numBytesUsed: 4 bytes
numElements: 4 bytes
How It Is Used in Queries
During query execution, the dictionary is consulted to translate filter values, the inverted index (bitmap) identifies matching rows, and the encoded dimension values provide fast access to the underlying integer codes. The article demonstrates this with a simple SQL query:
select city, sum(click_cnt) from table_t where category=0 or category=1 group by city
The workflow uses the dictionary (steps 1 & 6), the inverted index (step 3), and the encoded dimension values (step 4) to resolve the query.
In conclusion, the article provides a comprehensive view of Druid's storage architecture, highlighting the interplay between dictionaries, encoded values, and bitmap indexes.
Big DataindexingOLAPData StorageApache DruidColumnar Database
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.