Big Data 11 min read

Building a User Profile Platform with ClickHouse at 58.com: Architecture and Optimization

This article describes how 58.com designed and implemented a large‑scale user profiling platform using ClickHouse, covering system overview, core modules, major challenges of scale, complexity and performance, and the detailed storage, query, and optimization techniques applied to meet business needs.

DataFunSummit

May 18, 2024

The article introduces 58.com’s user profiling system, outlining its positioning, core functions, and the challenges it faces, such as massive data scale, a huge number of tags, complex logic, and strict performance requirements.

The platform, named "Wanxiang", consists of two main modules: data aggregation (handling tag lifecycle management, tag production, tag market, and tag operation) and data application (providing fast data services for crowd statistics, crowd analysis, and crowd selection).

To address the challenges, ClickHouse is adopted as the underlying storage engine. The design uses a vertical table model where each tag value occupies a separate row, enabling sparse indexing and bitmap storage to replace costly table joins.

Logical storage combines distributed tables for query routing and local tables (MergeTree and AggregatingMergeTree) for data persistence. Data is partitioned by date and tag, and sharded by user code ranges to achieve parallel processing.

The storage architecture organizes ClickHouse tables by tag source, with distributed tables on top and per‑shard local tables using AggregatingMergeTree. DataParts are managed to avoid large merges, and index granularity is set to 128 rows to improve query efficiency for small‑condition scans.

Bitmap optimization leverages roaring bitmap structures and bucketed sequential encoding to reduce data size and accelerate calculations.

Performance optimizations include tuning DataPart merging, sparse index granularity, bitmap encoding, and a two‑stage query workflow (query + fetch) that offloads result fetching to Spark, reducing coordination node load.

The article concludes with the achieved service effects, future plans such as further query architecture optimization, support for real‑time data analysis, and building a closed‑loop data application ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data ClickHouse Data Architecture

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.