Big Data 20 min read

Design and Implementation of a User Data Warehouse and Profiling System at 58.com

This article details the design and implementation of a user data warehouse at 58.com, covering data warehouse fundamentals, user profiling concepts, multi‑layer architecture, modeling methods, ETL migration from Hive to Spark, data quality assurance, and the resulting achievements.

DataFunSummit

Mar 24, 2024

Design and Implementation of a User Data Warehouse and Profiling System at 58.com

The presentation introduces the speaker, Bao Lei, a senior data R&D engineer at 58.com, and outlines three main parts: an overview of data warehouses and user profiling, the construction process of the user profiling data warehouse, and the final outcomes.

It explains that a data warehouse is an integrated, subject‑oriented, relatively stable collection of data that stores historical changes. By organizing data by subject domains (e.g., traffic), the warehouse enables fast, consistent access, high‑quality output, rapid response to business changes, data security, timely services, and better decision‑making.

User profiling is described as the process of labeling users with statistical or algorithmic tags derived from their behavior and attributes, providing personalized recommendations and precise marketing. The speaker emphasizes the need for a complete big‑data application ecosystem to collect and exploit user behavior data.

The construction of the profiling warehouse follows a four‑layer architecture (ODS → DWD → DWS → APP). Each layer progressively refines raw data into detailed, aggregated, and application‑ready tables. The process includes cleaning up legacy tag definitions, standardizing data models, and establishing design, technical review, development, testing, and release stages.

Four data modeling approaches are compared (ER, Data Vault, Anchor, dimensional modeling). Dimensional modeling is chosen for its suitability to low‑cost, high‑efficiency tag production in profiling scenarios.

ETL tasks originally built with Hive SQL suffered from low performance and heavy Yarn scheduling pressure. Migrating to Spark reduced container usage by ~90% and significantly shortened job runtimes.

Data quality assurance is achieved through source‑level quality inspection tools, SLA monitoring of core metrics, and coverage/accuracy checks for generated tags, ensuring reliable and accurate outputs.

Finally, the profiling warehouse now covers eight subject domains, contains 102 tables, defines 312 metrics, and runs 256 ETL tasks, demonstrating a clear architecture, organized data, and strong support for business analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Warehouse ETL Spark

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.