Using ClickHouse for Large‑Scale User Behavior Analysis at Beike Zhaofang
This article details how Beike Zhaofang leveraged the ClickHouse columnar OLAP database for large‑scale user behavior analysis, covering its architecture, key features, performance benchmarks against other engines, data ingestion pipelines, custom UDFs for funnel and retention metrics, deployment setup, and future enhancements.
User behavior analysis is crucial for product growth; Beike Zhaofang built an internal analytics platform called Compass that uses ClickHouse as the query engine to support second‑level custom analysis on billions of log records.
What is ClickHouse?
ClickHouse is an open‑source column‑oriented OLAP DBMS originally developed by Yandex for web analytics. It offers fast SQL‑like queries, rich aggregation functions, columnar storage, a vectorized execution engine, real‑time ingestion, and lacks traditional OLTP features such as transactions.
Key Features
SQL support with extensive array and aggregation functions.
Columnar storage that reduces I/O by reading only required columns, enabling up to 20× less data read.
Vector engine that processes data in batches, leveraging SIMD instructions for speed.
Blazing fast query performance – benchmarked to be 100× faster than Hive and 6‑10× faster than commercial columnar databases.
Real‑time data ingestion via MergeTree engines.
Limitations: no transactions, limited high‑concurrency OLTP, and poor support for long text.
Practical Experience – User Behavior Analysis
Project requirements included high performance, real‑time capability, and scalability. After evaluating several OLAP engines (Kylin, Druid, Tez, SparkSQL, Impala, ClickHouse), ClickHouse demonstrated superior speed (≈1 s for grouping 1 billion rows) and real‑time query ability, leading to its selection.
Data Ingestion & Table Design
Raw logs are cleaned with Spark Streaming, written to Kafka, and then ingested into ClickHouse via two pipelines: a real‑time stream using Hangout to read Kafka and write directly, and an offline pipeline using Flume → HDFS → Hive → Spark → ClickHouse. Tables use ReplicatedMergeTree with event and timestamp as primary keys and event_uuid for sharding.
Custom UDFs for Analytics
Two main analytics were implemented:
Event count and retention: a simple GROUP BY query on os and event_uuid .
Conversion funnel: a custom aggregate function FunnelEvtNameTag that sorts unordered events, matches them to a predefined step sequence, and returns the highest completed step per user. The result is further processed with ARRAY JOIN to obtain per‑dimension funnel metrics.
SELECT os, count(*), count(DISTINCT event_uuid)
FROM compass.logs_all
WHERE date = '2018-08-20' AND event = 'IMClick'
GROUP BY os; SELECT system_type,
countIf(1, steps >= 1) AS step1_cnt,
countIf(1, steps >= 2) AS step2_cnt,
...
FROM (
SELECT event_uuid,
max(funnel) AS steps,
arrayFilter(system_type -> isNotNull(system_type),
groupArray(system_type))[1] AS system_type
FROM (
SELECT event_uuid,
funnelEvtNameTag(86400000, 1534521600000, 1534607999999,
'SearchRequst','SearchClick','HouseDetailView',
'Click400','ConfirmClick400')(client_timestamp,
concat(event,'\u0001',system_type,'\u0001')) AS funnels
FROM compass.logs
WHERE event IN ('SearchRequst','SearchClick','HouseDetailView','Click400','ConfirmClick400')
AND client_timestamp BETWEEN 1534521600000 AND 1534608000000
AND data_date BETWEEN toDate('2018-08-18') AND toDate('2018-08-19')
GROUP BY event_uuid
)
GROUP BY event_uuid
) ARRAY JOIN funnels.1 AS funnel, funnels.2 AS dim
WHERE funnel > 0
GROUP BY system_type;These queries process ~20 billion rows in about 5 seconds.
Deployment & Ecosystem
The ClickHouse cluster runs on three 80‑core, 200 GB RAM Tencent Cloud machines with ReplicatedMergeTree + ZooKeeper for replication. Additional components include HDFS/MySQL/Kafka ingestion tools, a Tabix/Superset visual query UI, Grafana dashboards for slow‑query monitoring, and a custom SQL API.
Other Projects & Future Work
Beyond Compass, ClickHouse powers DMP, search behavior replay, and MerLin projects, handling >6 hundred million rows per day and ~50 k queries daily. Future efforts focus on performance bottleneck detection, sharding (resharding), higher concurrent query throughput, and fine‑grained data access control.
Conclusion
ClickHouse provides a fast, scalable query engine suitable for analyzing tens of billions of rows, and has become the backbone of Beike Zhaofang’s user behavior analytics and related data‑intensive services.
Beike Product & Technology
As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.