Databases 22 min read

Cold‑Hot Storage Architecture and Practices in Shopee ClickHouse

Shopee ClickHouse implements a cold‑hot storage architecture by mounting JuiceFS‑backed S3 as a remote volume, using table‑level storage policies and TTL‑driven background moves to shift data from SSD to cold storage, achieving cost‑effective scalability, improved disk utilization, and validated performance despite operational challenges such as Redis memory pressure and small‑file handling.

Shopee Tech Team
Shopee Tech Team
Shopee Tech Team
Cold‑Hot Storage Architecture and Practices in Shopee ClickHouse

Shopee ClickHouse is a high‑availability distributed analytical database derived from the open‑source ClickHouse engine. This article describes the cold‑hot storage architecture built on JuiceFS and S3, the implementation details, performance benchmarks, operational issues, and practical solutions.

Overall Cluster Architecture

ClickHouse provides a column‑store OLAP engine with vectorized execution. Shopee ClickHouse adds a second‑generation development and architecture evolution layer. The cluster consists of an SLB for request routing, a Proxy layer for query routing, a ClickHouse DB cluster (both SSD‑based compute clusters and SATA‑based storage clusters), and a remote object store (Shopee STO S3) used for cold data.

Cold‑Hot Storage Solution

The cold‑hot design mounts remote object storage via the JuiceFS client, exposing it as a local path. Table‑level storage policies allow different teams to define their own hot‑cold thresholds. New tables specify the cold‑storage policy at creation, and existing tables are altered to use the policy. Data is moved between SSD and S3 based on TTL expressions.

Typical storage policy configuration (simplified):

storage_policy = 'hcs_ck'

Example CREATE TABLE with TTL and storage policy:

CREATE TABLE db_name.hcs_table_name (
    `log_time` DateTime64(3),
    `log_level` String,
    ...
    `create_time` DateTime DEFAULT now()
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/db_name.hcs_table_name', '{replica}')
PARTITION BY toYYYYMMDD(log_time)
ORDER BY (ugi, ip)
TTL toDateTime(log_time) TO VOLUME 'v_ssd',
    toDateTime(log_time) + toIntervalDay(7) TO VOLUME 'v_cold',
    toDateTime(log_time) + toIntervalDay(14) DELETE
SETTINGS index_granularity = 16384,
    storage_policy = 'hcs_ck',
    parts_to_throw_insert = 1600;

Benchmark Results

Two storage back‑ends (S3 and Ozone) were benchmarked using the ClickHouse TPCH Star Schema (1000‑scale). Insert performance favored S3, while query latency varied significantly. Sample query latency (seconds) is shown in the table below:

Query No.

JuiceFS (S3+Ozone)

⅓ JuiceFS + ⅔ SSD

SSD only

Q1.1

8.884

8.966

1.417

Q1.2

0.921

0.998

0.313

Q2.1

68.148

36.273

5.450

Q4.3

33.667

2.813

2.357

Based on overall cost, stability, functionality, performance and scalability, S3 was chosen as the cold storage medium.

Implementation Details

Data parts are moved by a background thread pool ( background_move_pool ) distinct from the general background pool. The mover selects parts whose TTL indicates they should be relocated, creates a clone on the remote volume, and then detaches the original part. The move operation is transparent to queries because the active part is always either on SSD or on S3.

Key code snippets:

std::optional<BackgroundProcessingPool> background_move_pool; // thread pool for background moves
MergeTreePartsMover::selectPartsForMove
MergeTreeData::moveParts
void MergeTreeData::swapActivePart(MergeTreeData::DataPartPtr part_copy) { ... }

Operational Issues and Solutions

Redis Memory Growth

JuiceFS stores metadata in Redis; a large number of small files can exhaust Redis memory, causing mount failures. Monitoring clickhouse_merge metrics and setting alerts when merge frequency drops below 0.5 per three hours helps detect problems early.

Handling Excess Small Files

When many tiny parts are already on S3, stop further moves, locate offending tables via error logs or system.parts , and either stop moves with SYSTEM STOP MOVES [db.]table or manually move parts back to SSD using ALTER TABLE table MOVE PARTITION ... TO VOLUME 'ssd_volume' .

JuiceFS S3 Read/Write Failures

Errors such as “send request to S3 host name certificate expired” are diagnosed via JuiceFS logs (e.g., cat /var/log/messages | grep 'juicefs' ) and resolved by the S3 team. Monitoring the metric juicefs_object_request_errors provides early alerts.

ClickHouse Server Startup Failures

Inconsistent TTL definitions between local .sql files and ZooKeeper can prevent server start. Ensure TTL changes are applied to both ZooKeeper and local metadata before restarting.

Suspicious Broken Parts

During move operations, ClickHouse creates a marker file #DELETE_ON_DESTROY_MARKER_PATH . Accumulation of such parts triggers the suspicious_broken_parts error, preventing startup. Remedies include deleting the problematic table’s metadata and recreating it, or creating a force_restore_data flag under /flags and restarting.

Monitoring Metrics

JuiceFS: juicefs_object_request_errors

Redis: Memory usage

ClickHouse: clickhouse_merge metric

Benefits

After deploying cold‑hot tiering, average disk usage dropped from ~85% to ~75%, enabling two additional data services and storing >90 TB of compressed data on S3. The architecture also improves scalability and reduces storage costs.

Conclusion

The JuiceFS + S3 cold‑hot tiering solution provides flexible, cost‑effective storage for ClickHouse workloads. Ongoing work includes extending JuiceFS to HDFS and further separating storage and compute.

Performance BenchmarkClickHouseData TieringCold-Hot StorageDistributed OLAPJuiceFSS3
Shopee Tech Team
Written by

Shopee Tech Team

How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.