Big Data 17 min read

Bitmap-Based User Segmentation in a DMP Platform Using ClickHouse

This article describes how a data management platform (DMP) at Beike leverages ClickHouse bitmap structures and Spark pipelines to generate global numeric user IDs, design tag-specific bitmap rules for enum, continuous, and date attributes, handle boundary cases, and produce high‑performance bitmap SQL for real‑time user group estimation and complex segment logic.

Beike Product & Technology
Beike Product & Technology
Beike Product & Technology
Bitmap-Based User Segmentation in a DMP Platform Using ClickHouse

The DMP data management platform at Beike, built since May 2018, provides user profiling, segmentation, and messaging capabilities for hundreds of millions of users across Beike and Lianjia, requiring fast and accurate group package generation.

To overcome the limitations of Hive‑based relational tables and full‑table scans, the platform migrated the profiling data to ClickHouse, which natively supports bitmap data structures, enabling set‑based operations (intersection, union, complement) for user groups.

A novel method was introduced to generate a global continuous numeric join_id for billions of STRING‑based user keys by partitioning the data into sub‑datasets, applying ROW_NUMBER() within each partition, and offsetting the row numbers to obtain unique integers without causing data skew.

Bitmap construction follows three tag categories:

Enum tags: store a bitmap for each value and a full‑bitmap; equality selects the specific bitmap, inequality uses XOR with the full‑bitmap.

Continuous tags: build >=X bitmaps for each integer X in a defined range; equality is derived via XOR of adjacent >=X bitmaps, and other operators map directly to appropriate >=X bitmaps.

Date tags: convert dates to numeric yyyyMMdd , build <=X bitmaps for each day, and apply similar XOR logic for equality and other comparisons.

Boundary handling ensures that queries for extreme values (e.g., =7 when only >=0…>=7 exist) correctly map to existing bitmaps without generating empty results.

In ClickHouse, a null‑engine table stores bitmap strings, which are materialized into an AggregateFunction(groupBitmap, UInt32) via base64Decode() . A buffer‑engine table fronts the pipeline to reduce load on the ClickHouse cluster.

Spark jobs read Hive tables, apply the bitmap rules using DataSet transformations, aggregate user IDs per tag value, and write the results to ClickHouse. For enum tags, values are grouped directly; for continuous and date tags, flatMap expands rows to cover all relevant thresholds before aggregation.

Bitmap‑based SQL is generated by translating tag logic into bitmap functions such as bitmapAnd , bitmapOr , bitmapXor , bitmapToArray , and bitmapCardinality , with GLOBAL IN replacing costly joins, achieving second‑level group count estimation and minute‑level complex logic execution.

Since deployment, the solution has supported over 500 daily segment packages, reaching tens of millions of users per day, and the team plans further extensions to cover all profiling tags, accelerate bitmap generation, and broaden application scenarios.

data engineeringbig datauser segmentationClickHouseBitmapSparkDMP
Beike Product & Technology
Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.