Databases 15 min read

ClickHouse Architecture and Core Technologies Overview

ClickHouse is an open‑source, massively parallel, column‑oriented OLAP database that integrates its own columnar storage, vectorized batch processing, pre‑sorted data, diverse table engines, extensive data types, sharding with replication, sparse primary‑key and skip indexes, and a multithreaded query engine, delivering high‑throughput real‑time analytics on massive datasets.

JD Retail Technology

Apr 8, 2025

ClickHouse Architecture and Core Technologies Overview

In the era of big data, ClickHouse emerges as an open‑source distributed OLAP database developed by Yandex, offering real‑time analytical processing for massive datasets.

Overall Architecture : ClickHouse follows a MPP (Massively Parallel Processing) design where each node is peer‑to‑peer, providing both storage and query processing layers. Unlike many big‑data engines that separate compute from storage, ClickHouse integrates its own columnar storage, enabling storage‑side optimizations for query execution.

Columnar Storage ("Sword" style) : Data for each column is stored in separate files, allowing queries to read only the needed columns. This design yields high compression ratios (often around 8:1) and reduces I/O by decompressing only relevant column blocks.

Vectorized Execution ("Blade" style) : ClickHouse processes data in batches (e.g., 1024 rows) using SIMD instructions, which improves CPU cache utilization and overall query throughput.

Pre‑sorting ("Spear" style) : Before persisting, data is sorted according to primary and sorting keys (LSM‑like algorithm). Sorted data enables efficient range scans and reduces disk reads.

Table Engines ("Whip" style) : Different table engines define where and how data is stored, how writes are handled, and which queries are supported. They also control concurrency, indexing, and replication behavior.

Data Types : ClickHouse supports over 100 types, including basic numeric types, dates, strings, complex structures (Array, Tuple, Nested, Map), aggregate function types, and special types like UUID, IPv4/IPv6, Nullable, and LowCardinality for dictionary encoding.

Sharding and Replication ("Palm" style) : Data is horizontally sharded across nodes and vertically replicated for fault tolerance. Sharding can be based on fixed fields, random functions, or hash of a key. Replicas are selected for queries using load‑balancing strategies such as Random, Nearest hostname, Levenshtein distance, In‑Order, First‑or‑Random, and Round‑Robin.

Index Design ("Arrow" style) : ClickHouse uses sparse primary‑key indexes stored in separate files and supports skip‑index types like minmax, set, and Bloom filter to prune irrelevant data blocks during query execution.

Computation Engine ("Qi" style) : The engine translates SQL into physical plans, executes them with multithreading, and distributes work across nodes. While performant, it lacks a sophisticated optimizer and has limited JOIN support.

Overall, the combination of columnar storage, vectorized execution, aggressive compression, and a distributed MPP architecture makes ClickHouse a high‑performance solution for large‑scale analytical workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data distributed architecture ClickHouse OLAP Columnar Storage Vectorized Execution

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.