Building a User Profile Platform with ClickHouse at 58.com: Architecture and Optimization
This article describes how 58.com designed and implemented a large‑scale user profiling platform using ClickHouse, covering system overview, core modules, major challenges of scale, complexity and performance, and the detailed storage, query, and optimization techniques applied to meet business needs.
The article introduces 58.com’s user profiling system, outlining its positioning, core functions, and the challenges it faces, such as massive data scale, a huge number of tags, complex logic, and strict performance requirements.
The platform, named "Wanxiang", consists of two main modules: data aggregation (handling tag lifecycle management, tag production, tag market, and tag operation) and data application (providing fast data services for crowd statistics, crowd analysis, and crowd selection).
To address the challenges, ClickHouse is adopted as the underlying storage engine. The design uses a vertical table model where each tag value occupies a separate row, enabling sparse indexing and bitmap storage to replace costly table joins.
Logical storage combines distributed tables for query routing and local tables (MergeTree and AggregatingMergeTree) for data persistence. Data is partitioned by date and tag, and sharded by user code ranges to achieve parallel processing.
The storage architecture organizes ClickHouse tables by tag source, with distributed tables on top and per‑shard local tables using AggregatingMergeTree. DataParts are managed to avoid large merges, and index granularity is set to 128 rows to improve query efficiency for small‑condition scans.
Bitmap optimization leverages roaring bitmap structures and bucketed sequential encoding to reduce data size and accelerate calculations.
Performance optimizations include tuning DataPart merging, sparse index granularity, bitmap encoding, and a two‑stage query workflow (query + fetch) that offloads result fetching to Spark, reducing coordination node load.
The article concludes with the achieved service effects, future plans such as further query architecture optimization, support for real‑time data analysis, and building a closed‑loop data application ecosystem.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.