Accelerating User Profile Analysis with Hologres RoaringBitmap
The article explains how Hologres RoaringBitmap compresses user ID sets into efficient bitmap indexes, splits 64‑bit IDs into buckets, syncs them from MaxCompute, and enables sub‑second user portrait queries that previously took minutes, dramatically improving performance and scalability.
This article introduces the use of Hologres RoaringBitmap to speed up user portrait analysis, turning multi‑minute queries into sub‑second responses for massive data volumes.
Bitmap principle : A bitmap stores a set of IDs as bits, offering high space and compute efficiency. RoaringBitmap compresses sparse bitmaps by dividing the 32‑bit space into 2^16 chunks and using three container types (Array, Bitmap, Run) to store the low 16 bits.
RoaringBitmap containers :
Array Container – optimal for sparse data (< 4096 values). Bitmap Container – optimal for dense data (> 4096 values). Run Container – uses run‑length encoding for consecutive values.
System architecture : The platform extracts tag and crowd data from MaxCompute, builds bitmap indexes, syncs them to Hologres, and runs analysis queries using RoaringBitmap functions. A DAG scheduler orchestrates daily bucket construction and data synchronization.
UID bucket design : Because Hologres RoaringBitmap functions only support 32‑bit integers, 64‑bit user IDs are split into a high‑level (44 bits) and a low‑level (20 bits). The low‑level part is stored in the bitmap, while the high‑level part determines the bucket, dramatically reducing I/O.
Key SQL snippets :
-- MaxCompute table creation CREATE TABLE IF NOT EXISTS demo_table( field_value STRING COMMENT 'tag value', bucket BIGINT COMMENT 'bucket', bitmap BINARY COMMENT 'uid bitmap' ) PARTITIONED BY (ds STRING, label_id STRING) LIFECYCLE 365; -- Insert bitmap data INSERT OVERWRITE TABLE demo_table PARTITION(ds='${bizdate}', label_id='${label_id}') SELECT COALESCE(${label_field}, 'NULL') AS field_value, SHIFTRIGHT(CAST(COALESCE(${uidField}, '0') AS BIGINT), 20) AS bucket, ENCODE(mc_rb_build_agg(CAST(COALESCE(${uidField}, '0') AS BIGINT) & 1048575), 'utf-8') AS bitmap FROM ${dataSource}.${dataTable} WHERE ds=MAX_PT('${dataSource}.${dataTable}') AND CAST(COALESCE(${uidField}, '0') AS BIGINT) > 0 GROUP BY COALESCE(${label_field}, 'NULL'), SHIFTRIGHT(CAST(COALESCE(${uidField}, '0') AS BIGINT), 20); -- Hologres foreign table for bitmap sync CREATE FOREIGN TABLE IF NOT EXISTS public.foreign_table( field_value TEXT, bucket INT8, bitmap BYTEA, ds TEXT, label_id TEXT ) SERVER odps_server OPTIONS (project_name '${projectName}', table_name 'foreign_table_${index}'); -- Hologres internal table CREATE TABLE IF NOT EXISTS public.holo_table( field_value TEXT, bucket INT8, bitmap ROARINGBITMAP, ds TEXT, label_id TEXT, PRIMARY KEY (ds, label_id, bucket, field_value) ) PARTITION BY LIST(ds); -- Analysis query SELECT ${label_alias} AS "${label_name}", SUM(rb_and_cardinality(t1.bitmap, t2.bitmap)) AS "count" FROM (SELECT bucket, bitmap FROM public.holo_crowd_table WHERE crowd_id='${crowd_id}') t1 JOIN (SELECT field_value, bucket, bitmap FROM public.holo_table WHERE ds='${label_ds}' AND label_id='${label_id}') t2 ON t1.bucket = t2.bucket GROUP BY ${label_alias};Java DTO example used for acceleration configuration:
public class AccelerateConfigDTO { private Boolean isAccelerationCompleted; private Boolean isAccelerationEnabled; private String lastDs; }Performance results : Over 75% of queries finish within 10 seconds, with some cases dropping from minutes to seconds. The bitmap solution accounts for ~60% of total latency, while the remaining time is spent in ODPS (MaxCompute) when the bitmap cannot fully cover the workload.
Conclusion : By leveraging Hologres RoaringBitmap, the platform achieved order‑of‑magnitude speedups for user portrait analysis, reduced resource consumption, and provided a scalable solution for future data growth.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.