Databases 20 min read

Using ClickHouse for High‑Performance Keyword Hit Statistics

This article presents the background and challenges of large‑scale keyword hit statistics, explains why traditional MySQL solutions struggle, and details how ClickHouse’s columnar storage, vectorized execution, and distributed architecture provide fast, scalable analytics, including cluster setup, table schema, queries, and migration lessons.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Using ClickHouse for High‑Performance Keyword Hit Statistics

1. Background

1.1 Business Requirement

Audit systems need daily statistics of keyword hit counts, with filtering and sorting based on trigger times and deletion rates.

1.2 Problem Difficulty

Keyword table contains ~700,000 rows; daily statistics generate ~21,000,000 rows.

Deletion rate = delete_times / trigger_times, requiring aggregation of all trigger and delete events before filtering and sorting.

1.2.1 Traditional MySQL Solution

Data is sharded by month; each shard holds up to 21 million rows. Queries must identify relevant shards, filter, aggregate, sort, and merge results, which is costly.

1.2.2 MySQL‑Java Solution

Aggregation and sorting steps are moved to JVM memory to avoid overloading MySQL nodes, but indexing offers little benefit for the required filters.

1.2.3 Issues with Traditional Approaches

Flexible filter conditions lead to many indexes, hurting write performance.

Aggregated data is filtered and sorted, making indexes ineffective.

Large data volume overwhelms the database and degrades concurrency.

2. ClickHouse Solution

2.1 ClickHouse Overview

ClickHouse is an open‑source high‑performance analytical SQL database (OLAP) developed by Yandex in 2016. It stores data column‑wise, uses vectorized execution, and provides fast compression and high‑throughput queries.

Key advantages:

Columnar storage with vectorized processing for efficient CPU usage.

High compression ratios (10×+), reducing I/O and improving cache effectiveness.

Non‑B‑tree indexes that do not require left‑most prefix.

Very fast writes (50‑200 M rows/s).

2.2 How ClickHouse Solves the Problems

2.2.1 Columnar Storage – Flexible Filters & Fast Response

Only columns needed for a query are read, dramatically lowering I/O. Same‑type column data compresses well, yielding small on‑disk sizes and better cache utilization.

2.2.2 Multi‑Core Parallelism & Vectorized Execution

Data is partitioned and processed concurrently across CPU cores. Vectorized execution runs SIMD instructions on batches of column values, reducing function‑call overhead and cache misses.

2.2.3 Data Sharding & Distributed Computing

ClickHouse distributes data across shards and replicas. Queries are automatically split into tasks and executed in parallel on multiple nodes, then results are merged.

2.2.4 OLAP‑Oriented Technology Selection

Compared with TiDB and DorisDB, ClickHouse offers superior query performance for large‑scale aggregation, better support for array‑type columns, and simpler update handling for the keyword use‑case.

3. Using ClickHouse for Keyword Hit Statistics

3.1 Cluster Creation

A cluster named cluster_3shards_2replicas with 6 nodes (3 shards, 2 replicas) is deployed across three physical machines.

3.2 Table Design

Two tables are created: a keyword metadata table and a daily hit‑statistics table.

3.2.1 Keyword Table DDL

create table monitor_banword_local on cluster cluster_3shards_2replicas (
    id Int32 COMMENT '关键词ID',
    type UInt8 COMMENT '关键词类型',
    name String COMMENT '关键词',
    product Array(UInt8) COMMENT '适用产品',
    parent_id UInt8 COMMENT '级别',
    is_core UInt8 COMMENT '是否核心词:0:否;1:是',
    status UInt8
) ENGINE = MergeTree
PARTITION BY type
ORDER BY type

3.2.2 Hit Statistics Table DDL

CREATE TABLE banword_hit_statistics_mt_2021_local on cluster cluster_3shards_2replicas (
    id Int32 COMMENT '统计ID',
    hit_banword_id Int32 COMMENT '命中关键词ID',
    trigger_times Int32 COMMENT '触发次数',
    delete_times Int32 COMMENT '删除次数',
    delete_rate Decimal(9,6) COMMENT '删除率',
    snapshot_date Date COMMENT '数据快照时间',
    update_time Date COMMENT '更新时间'
) ENGINE = MergeTree
PARTITION BY toYYYYMM(snapshot_date)
ORDER BY (snapshot_date, hit_banword_id)

3.2.3 Distributed Views

CREATE TABLE monitor_banword_all ON CLUSTER cluster_3shards_2replicas (
    id Int32 COMMENT '关键词ID',
    type UInt8 COMMENT '关键词类型',
    name String COMMENT '关键词',
    product Array(UInt8) COMMENT '适用产品',
    parent_id UInt8 COMMENT '级别',
    is_core UInt8 COMMENT '是否核心词:0:否;1:是',
    status UInt8
) ENGINE = Distributed(cluster_3shards_2replicas, default, monitor_banword_local, intHash64(type));

CREATE TABLE banword_hit_statistics_mt_2021_all ON CLUSTER cluster_3shards_2replicas (
    id Int32 COMMENT '统计ID',
    hit_banword_id Int32 COMMENT '命中关键词ID',
    trigger_times Int32 COMMENT '触发次数',
    delete_times Int32 COMMENT '删除次数',
    delete_rate Decimal(9,6) COMMENT '删除率',
    snapshot_date Date COMMENT '数据快照时间',
    update_time Date COMMENT '更新时间'
) ENGINE = Distributed(cluster_3shards_2replicas, default, banword_hit_statistics_mt_2021_local, toMonth(snapshot_date));

3.2.4 Sample Query

select
    hit_banword_id,
    sum(trigger_times) as triggerTimes,
    sum(delete_times) as deleteTimes,
    if(deleteTimes>0, divide(deleteTimes, triggerTimes), 0) as deleteRate
from default.banword_hit_statistics_mt_2021_all
where hit_banword_id GLOBAL IN (
        select id from monitor_banword_all where parent_id = 2
    )
    and snapshot_date between toDate('2022-01-01') and toDate('2022-07-01')
group by hit_banword_id
having triggerTimes > 1
order by deleteRate desc
limit 5000;

The query processes 120 million rows in about 1.8 seconds, demonstrating ClickHouse’s speed.

3.3 Migration Pitfalls

Too many partitions per INSERT block – increase max_partitions_per_insert_block or set it to 0.

SQL syntax nuances: use backticks for column names, specify the database explicitly, and use ALTER … ON CLUSTER … DELETE/UPDATE for distributed tables.

3.4 Comparison with Traditional Solutions

MySQL

MySQL‑Java

ClickHouse

1 day (0.7 M rows)

5 s

12 min

350 ms

1 month (21 M rows)

2 min

3.7 h

<2 s

6 months (120 M rows)

>20 min

>12 h

<2 s

ClickHouse consistently delivers sub‑second response times even on massive datasets, while MySQL‑based approaches become impractically slow.

4 Summary & Outlook

ClickHouse’s columnar storage, high compression, vectorized execution, and distributed architecture make it ideal for large‑scale statistical workloads. It offers simple JDBC connectivity and rich SQL support, though it lacks strong concurrency for mixed workloads and does not provide transactions.

Future work includes extending the approach to other structured data analytics such as AI‑generated text detection statistics and historical audit record analysis.

Big DataSQLClickHouseColumnar DatabaseKeyword Statistics
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.