Databases 13 min read

Using ClickHouse for Large‑Scale User Behavior Analysis at Beike Zhaofang

This article details how Beike Zhaofang leveraged the ClickHouse columnar OLAP database for large‑scale user behavior analysis, covering its architecture, key features, performance benchmarks against other engines, data ingestion pipelines, custom UDFs for funnel and retention metrics, deployment setup, and future enhancements.

Beike Product & Technology
Beike Product & Technology
Beike Product & Technology
Using ClickHouse for Large‑Scale User Behavior Analysis at Beike Zhaofang

User behavior analysis is crucial for product growth; Beike Zhaofang built an internal analytics platform called Compass that uses ClickHouse as the query engine to support second‑level custom analysis on billions of log records.

What is ClickHouse?

ClickHouse is an open‑source column‑oriented OLAP DBMS originally developed by Yandex for web analytics. It offers fast SQL‑like queries, rich aggregation functions, columnar storage, a vectorized execution engine, real‑time ingestion, and lacks traditional OLTP features such as transactions.

Key Features

SQL support with extensive array and aggregation functions.

Columnar storage that reduces I/O by reading only required columns, enabling up to 20× less data read.

Vector engine that processes data in batches, leveraging SIMD instructions for speed.

Blazing fast query performance – benchmarked to be 100× faster than Hive and 6‑10× faster than commercial columnar databases.

Real‑time data ingestion via MergeTree engines.

Limitations: no transactions, limited high‑concurrency OLTP, and poor support for long text.

Practical Experience – User Behavior Analysis

Project requirements included high performance, real‑time capability, and scalability. After evaluating several OLAP engines (Kylin, Druid, Tez, SparkSQL, Impala, ClickHouse), ClickHouse demonstrated superior speed (≈1 s for grouping 1 billion rows) and real‑time query ability, leading to its selection.

Data Ingestion & Table Design

Raw logs are cleaned with Spark Streaming, written to Kafka, and then ingested into ClickHouse via two pipelines: a real‑time stream using Hangout to read Kafka and write directly, and an offline pipeline using Flume → HDFS → Hive → Spark → ClickHouse. Tables use ReplicatedMergeTree with event and timestamp as primary keys and event_uuid for sharding.

Custom UDFs for Analytics

Two main analytics were implemented:

Event count and retention: a simple GROUP BY query on os and event_uuid .

Conversion funnel: a custom aggregate function FunnelEvtNameTag that sorts unordered events, matches them to a predefined step sequence, and returns the highest completed step per user. The result is further processed with ARRAY JOIN to obtain per‑dimension funnel metrics.

SELECT os, count(*), count(DISTINCT event_uuid)
FROM compass.logs_all
WHERE date = '2018-08-20' AND event = 'IMClick'
GROUP BY os;
SELECT system_type,
       countIf(1, steps >= 1) AS step1_cnt,
       countIf(1, steps >= 2) AS step2_cnt,
       ...
FROM (
    SELECT event_uuid,
           max(funnel) AS steps,
           arrayFilter(system_type -> isNotNull(system_type),
                       groupArray(system_type))[1] AS system_type
    FROM (
        SELECT event_uuid,
               funnelEvtNameTag(86400000, 1534521600000, 1534607999999,
                               'SearchRequst','SearchClick','HouseDetailView',
                               'Click400','ConfirmClick400')(client_timestamp,
                               concat(event,'\u0001',system_type,'\u0001')) AS funnels
        FROM compass.logs
        WHERE event IN ('SearchRequst','SearchClick','HouseDetailView','Click400','ConfirmClick400')
          AND client_timestamp BETWEEN 1534521600000 AND 1534608000000
          AND data_date BETWEEN toDate('2018-08-18') AND toDate('2018-08-19')
        GROUP BY event_uuid
    )
    GROUP BY event_uuid
) ARRAY JOIN funnels.1 AS funnel, funnels.2 AS dim
WHERE funnel > 0
GROUP BY system_type;

These queries process ~20 billion rows in about 5 seconds.

Deployment & Ecosystem

The ClickHouse cluster runs on three 80‑core, 200 GB RAM Tencent Cloud machines with ReplicatedMergeTree + ZooKeeper for replication. Additional components include HDFS/MySQL/Kafka ingestion tools, a Tabix/Superset visual query UI, Grafana dashboards for slow‑query monitoring, and a custom SQL API.

Other Projects & Future Work

Beyond Compass, ClickHouse powers DMP, search behavior replay, and MerLin projects, handling >6 hundred million rows per day and ~50 k queries daily. Future efforts focus on performance bottleneck detection, sharding (resharding), higher concurrent query throughput, and fine‑grained data access control.

Conclusion

ClickHouse provides a fast, scalable query engine suitable for analyzing tens of billions of rows, and has become the backbone of Beike Zhaofang’s user behavior analytics and related data‑intensive services.

data engineeringSQLClickHouseOLAPuser behavior analysisfunnel analysis
Beike Product & Technology
Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.