Big Data 20 min read

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

The article describes a real‑time user behavior analysis platform built on a ClickHouse cluster, detailing its architecture, Hive‑to‑ClickHouse data ingestion with user‑ID routing, table designs for behavior and group data, and five analytical methods—event, funnel, path, retention, and attribution—leveraging shard‑level parallelism and custom functions for high efficiency.

HelloTech

May 14, 2021

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

The user behavior analysis system utilizes detailed user behavior data to analyze ordered user actions. It is a highly flexible real‑time analysis system with features such as parameter filtering, audience filtering, multi‑dimensional splitting, and group splitting. The system architecture is shown in Figure 1.1 and includes five basic analysis functions: event analysis, funnel analysis, attribution analysis, path analysis, and behavior tracking. It uses behavior detail data as the fundamental metadata and group data for audience selection. After evaluating several computation engines, ClickHouse was finally selected as the engine.

Figure 1.1 User behavior analysis system architecture diagram

ClickHouse Cluster Deployment

ClickHouse tables are of two types: distributed tables and local tables. Distributed tables are logical, similar to views, and queries are usually run against them. The distributed table engine routes queries to local tables, aggregates the results, and returns them to the user; local tables store the actual data.

ClickHouse cluster architecture is shown in Figure 2.1. It relies on the ReplicatedMergeTree engine family and ZooKeeper to implement table replication, providing high availability. The Distributed engine implements the distributed‑table mechanism, creating views on all shards for distributed queries. ZooKeeper synchronizes data between local tables within each shard.

Figure 2.1 ClickHouse cluster structure diagram

Hive to ClickHouse Data Push

Data transfer from Hive to ClickHouse uses MapReduce. The push process, shown in Figure 3.1, reads data from Hive, routes it by user ID to one local table of a shard, and finally ZooKeeper synchronizes data between local tables of all shards. Routing by user ID decouples shards and improves query efficiency; details are described in Chapter 5.

Writing directly to a distributed table has several drawbacks:

After receiving data, the distributed table splits it into multiple parts and forwards them to other servers, increasing network traffic and merge workload, slowing writes and raising the risk of “Too many parts”.

Consistency issues: data lands on the distributed‑table machine first, then is asynchronously sent to the local‑table machine without consistency checks; if the distributed‑table machine crashes, data may be lost.

Writes are asynchronous by default, potentially causing temporary inconsistency.

High pressure on ZooKeeper.

Cannot achieve routing by user ID.

This article adopts a scheme that routes data in the map phase and writes directly to local tables. First, query the system.Clusters table to obtain shard information (IP addresses of devices in each shard). Then, route data read from Hive tables according to a rule (user_id % number_of_shards) to the appropriate shard batch. When a batch reaches a configured size, push the data to the corresponding shard. From the Clusters table, obtain a set of devices for the target shard; push data to the local table of one device, and if it fails, try other devices.

Figure 3.1 Hive‑to‑ClickHouse data push flow diagram

User Data and Table Design

User data includes behavior detail data and group data. Behavior detail data records a user's action at a specific timestamp; group data represents predefined or computed relationships between users and groups. Behavior detail data serves as the analysis metadata, while group data provides filtering conditions for targeted analysis.

1. Behavior Detail Data

Behavior detail data records a user's action at a specific time point, consisting of user ID, timestamp, and behavior attributes. For management, as shown in Figure 4.1, identical behavior attributes are packaged into events. An event consists of an event name and its constituent events, and events with the same attributes are grouped into the same event type.

Figure 4.1 Event‑user behavior relationship diagram

The packaged behavior detail data consists of event type, event name, timestamp, and event attribute fields. The fields of the behavior detail table are: user ID, event type, event name, timestamp, and event attribute fields. Since event attributes are determined by the event, the attribute fields are shared and managed by the database.

Event type is used as a partition of the behavior detail table to enable fast filtering.

Sorting is required for chronological analysis of a user's action sequence. Therefore, event name, user ID, and timestamp are used as sorting fields, with priority: event name > user ID > timestamp.

2. Group Data

Group data has two types: tag group data and event group data. Tag group data contains explicit user‑group relationships; event group data is derived from behavior detail data. Group data is stored using bitmaps to reduce storage and accelerate computation.

Tag group data

Tag group data comes from DMP calculations of user‑group relationships. To ensure routing by user ID, tag group creation follows three steps (Figure 4.2): data from Hive is pushed to ClickHouse at user‑ID granularity, then split and written to a split table, and finally written to an AggregatingMergeTree aggregation table where the group is the aggregation granularity and user IDs are combined into a bitmap.

Figure 4.2 Tag group creation flow diagram

Event group data

Since event group data already follows user‑ID routing via behavior detail data, we only need to perform the third step of tag group creation: extract user‑group relationships from events and write them into an AggregatingMergeTree aggregation table.

Behavior Analysis

Basic user behavior analysis scenarios include event analysis, funnel analysis, path analysis, behavior tracking, and attribution analysis. The basic analysis structure is shown in Figure 5.1. First, filter behavior detail data by time range, event range, and group range. Then, aggregate by user ID and dimensions to obtain user behavior sequences, and finally perform aggregation calculations on the sequences.

Figure 5.1 Basic analysis structure diagram

Because data is routed by user ID, all data for a single user resides in the same shard, and there is no coupling between shards. Therefore, analysis SQL can be executed on all shards in parallel and the results summed, achieving parallel computation. This differs from distributed‑table parallelism, which aggregates after function conversion; shard‑level parallelism aggregates both function conversion and detail aggregation before summing, offering higher efficiency.

1. Event Analysis

Event analysis aggregates detail data and does not consider user behavior sequences. It counts events across all users. Since data is routed by user ID, distinct counts (e.g., distinct order numbers) may be duplicated across shards, so results cannot be simply summed.

To resolve distinct‑count conflicts, an ordering field is added to bind the distinct field to a user. For example, in a group‑order scenario, binding order number to user yields correct distinct counts. This method trades space for time and is suitable because only a few fields require distinct counting across users.

Figure 5.2 Order number bound to user in a group‑order scenario

2. Funnel Analysis

Funnel analysis calculates conversion rates within a time window. ClickHouse provides the windowFunnel function (window)(timestamp, step1, step2, …) [4]. The official flow (Figure 5.3) sorts the behavior sequence by timestamp, extracts level information, generates feature vectors recording timestamps that satisfy each level within the window, and counts non‑empty timestamps as the maximum matched level for the user.

Figure 5.3 ClickHouse official windowFunnel function flow

The built‑in windowFunnel cannot perform dimension splitting. This article defines a custom dimFunnel function (window)(timestamp, dimension, step1, step2, …) that records a map of dimension → feature vector for the first level, enabling dimension‑level splitting. The flow is shown in Figure 5.4.

Figure 5.4 Custom dimFunnel function flow

3. Path Analysis

Path analysis counts user navigation paths within a session. The flow (Figure 5.5) sorts the behavior sequence, extracts a valid segment between start and end events, removes duplicates, splits when interval exceeds session time, keeps up to ten actions per session, and finally filters sequences that satisfy start and end conditions.

Figure 5.5 Path analysis computation flow

4. Behavior Tracking (Retention)

Behavior tracking, also called retention analysis, measures how many users who performed action A later perform action B. ClickHouse provides the retention function (start_condition, subsequent_condition1, …) [5]. The flow (Figure 5.6) initializes an array, sets the first element based on the start condition, then iterates subsequent conditions, marking presence with 1.

Figure 5.6 ClickHouse official retention function flow

Retention can be calculated per dimension of the start event (Figure 5.7). If a dimension does not appear on a given day, retention for that dimension on the next day is zero. Finally, results are aggregated by activity.

Figure 5.7 Dimension‑based retention calculation flow

5. Attribution Analysis

Attribution analysis quantifies the contribution of a series of user actions to a final goal. Common models include first‑touch, last‑touch, and linear attribution. The flow (Figure 5.8) sorts the behavior sequence, truncates at the target event, extracts attribution details (ratio, metric, dimension event ID, dimension value), and aggregates by dimension event ID and value. The final attribution ratio is the proportion of each dimension’s aggregated ratio to the total.

Figure 5.8 Attribution analysis computation flow

Summary and Outlook

This article presented the components of a user behavior analysis system, ClickHouse cluster deployment, Hive‑to‑ClickHouse data push, data modeling, and five analysis methods. By decoupling data per shard, parallel computation is achieved, improving efficiency, though thread consumption scales with the number of shards. While ClickHouse provides native functions, custom functions offer further performance gains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Analytics Big Data data pipeline ClickHouse Hive user behavior analysis

Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.