Big Data 20 min read

Design and Implementation of Beike's Data Management Platform (DMP)

This article details how Beike built a comprehensive Data Management Platform (DMP) that integrates user behavior and business data across multiple apps, outlines its five‑layer architecture, discusses data collection, processing, storage, real‑time profiling, and presents performance results and future optimization directions.

DataFunTalk
DataFunTalk
DataFunTalk
Design and Implementation of Beike's Data Management Platform (DMP)

Beike introduced a Data Management Platform (DMP) to unify user data from its apps, business databases, and external sources, enabling personalized push notifications, DSP advertising, in‑app recommendations, search, lead recall, and opportunity guidance.

Why DMP? By tagging users with detailed preferences (e.g., location, price range, house type), the platform can deliver highly targeted content, dramatically improving click‑through rates and conversion.

Overall Architecture – The DMP consists of five layers:

Data collection layer (Hive) gathers app behavior and offline business data.

Data processing layer creates basic, behavior, preference, and predictive datasets.

Application data storage layer uses HBase, ClickHouse, and MongoDB to serve high‑concurrency queries.

Application layer provides tag management, tag marketplace, crowd selection, crowd insight, and crowd expansion.

API layer offers unified, authenticated, rate‑limited services for downstream systems.

Key techniques include column pruning, pre‑aggregation, incremental computation, and dedicated Spark clusters to ensure timely processing of 180‑day windows containing billions of records.

Storage Solutions – HBase (SSD‑backed) handles KV queries within milliseconds; ClickHouse with Roaring Bitmap enables second‑level crowd estimation and minute‑level crowd construction; Redis caches hot user profiles for sub‑5 ms API responses, supporting up to 8 hundred million daily calls.

Real‑time Profiling – Streaming data from Kafka is aggregated via Spark Streaming, weighted by behavior importance and decay factors, and stored in Redis to power real‑time recommendation and search, yielding 3‑10 % CTR/CVR gains.

Challenges & Solutions – User identity unification via phone number and IMEI using Spark GraphX; massive data volume handled by column pruning, pre‑aggregation, incremental updates, and isolated queues; rapid tag rollout achieved through configuration‑driven SQL generation and automated Hive‑to‑ClickHouse pipelines.

Results – Over 4 billion users covered, 1 300+ tags with 60 % preference coverage, 200‑500 crowd packages generated daily, API latency ~5 ms with 99.9 % SLA, and significant business impact across the entire Beike ecosystem.

Future Work – Improve data accuracy and coverage, implement full user‑lifecycle management, and continue expanding real‑time profiling to further enhance personalized services.

Data Engineeringbig datareal-time analyticsHiveuser profilingDMPTagging System
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.