Databases 22 min read

ClickHouse Projection: Design, Implementation, Use Cases, and Production Impact

This article presents an in‑depth overview of ClickHouse Projection, covering its background, core features, architectural details, practical use cases, performance gains, advantages and drawbacks, and real‑world production results, illustrating how Projection enhances OLAP workloads at Kuaishou.

DataFunSummit
DataFunSummit
DataFunSummit
ClickHouse Projection: Design, Implementation, Use Cases, and Production Impact

1. ClickHouse Background

ClickHouse was created in 2008 for a Russian‑style web‑analytics system to replace MyISAM‑based MySQL, and was open‑sourced in 2016. Its columnar storage, LSM‑like structure, vectorized execution, and flexible P2P distributed architecture make it a high‑performance OLAP engine adopted by many internet companies.

Key Characteristics

Efficient read/write using columnar storage with advanced compression and encoding.

High‑throughput data processing via vectorization, column organization, and CPU‑instruction optimizations.

Loose P2P distributed design that scales to thousands of nodes.

For developers: real‑time writes, SQL‑centric operations, and a "white‑box" system that can be embedded and extended.

2. ClickHouse in Kuaishou OLAP Services

Kuaishou runs ClickHouse as its OLAP engine across multiple clusters, providing a unified query proxy layer for routing, monitoring, and service packaging. ETL services abstract data ingestion, allowing users to specify source streams without dealing with ClickHouse's write‑side complexities. The platform supports use cases such as retention analysis, A/B testing, video analytics, and risk control, handling tens of millions of daily queries and petabytes of data across thousands of nodes.

3. Projection Overview

Projection is a feature contributed by Kuaishou to the ClickHouse community. Inspired by Vertica’s concept, a Projection is a set of columns stored in a specific order or pre‑aggregated form. Kuaishou extended it to support arbitrary aggregations, partial table application, and lazy materialization, enabling performance improvements without sacrificing real‑time write capabilities.

Use‑Case Example

A video‑log table with columns user_id , device_id , domain , bytes , and duration receives ~20 k rows per second. Queries filtered by user_id are fast, but those filtered by device_id cause full‑table scans (~8 s). By defining a Projection ordered by device_id and materializing it, the same query runs 153× faster.

Projection Types

Normal Projection : stores selected columns in a different sort order.

Aggregate Projection : stores pre‑aggregated results (e.g., GROUP BY hour, domain with SUM(bytes) and AVG(duration)).

4. Projection Architecture

Components

Definition : can be created directly from a user query or inferred automatically.

Storage : implemented as a child Part of the original data Part, sharing partition information and participating in merges and mutations, thus guaranteeing consistency.

Query Analysis : the optimizer back‑traces the query plan, matches available Projections, selects the one with the smallest estimated scan size, and rewrites the plan to use it without requiring query changes.

Consistency Guarantees

INSERT: new data blocks feed both the base table and all defined Projections, producing synchronized Parts.

SELECT: when a Projection is chosen, its data is merged with the base data at runtime, preserving result correctness.

MUTATION: column changes trigger re‑materialization of dependent Projections, ensuring atomic updates.

5. Production Evaluation

Projections dramatically reduce query latency (e.g., a dashboard that previously took 30 s now renders in 1 s) and improve concurrency because they eliminate heavy aggregation work. Normal Projections increase storage by ~40 %, while Aggregate Projections add negligible overhead. Limitations include part‑level granularity (no cross‑part aggregation), inability to use separate storage tiers, and lack of join support.

6. Summary

Projection acts as a production‑ready materialized view for ClickHouse, offering automatic query acceleration, strong consistency across INSERT/SELECT/UPDATE/DELETE, and flexible definition directly from queries. It enhances ClickHouse’s OLAP capabilities while keeping the system’s SQL‑centric and block‑oriented design.

data engineeringperformance optimizationClickHouseOLAPProjection
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.