Architecture and Practices of Zhihu DMP System Based on Doris
This article presents a comprehensive overview of Zhihu's Data Management Platform (DMP), covering its business background, three core business modes, detailed architecture, offline and real‑time data pipelines, feature storage design, performance optimization techniques, and future iteration directions.
The presentation introduces Zhihu DMP, explaining its business background, the need for a customized data platform to support internal operations, and outlines four key aspects: background, architecture & implementation, challenges & solutions, and future outlook.
Three business modes are described—external‑to‑internal, internal‑to‑external, and internal closed‑loop—each supporting scenarios such as feed recommendation, advertising, detail‑page prompts, activity platforms, push systems, and external ad delivery.
Core functional requirements focus on audience management, including audience integration, targeting, and insight capabilities.
The platform architecture is divided into external modules (high‑availability APIs, simple front‑end, configurable back‑end) and business modules (audience selection, insight, ID‑mapping, feature production, storage, and AB‑testing), forming four major functional blocks.
Data pipelines consist of offline Spark batch processing that generates tag tables in Hive, followed by ID‑mapping to create unified user IDs stored in Doris, and real‑time Flink streams that produce live tags and perform the same mapping. Tags are indexed in Elasticsearch for fast lookup, while Doris stores the final user‑tag and ID‑mapping tables.
Performance challenges in audience targeting are addressed through bitmap inverted indexes, converting logical conditions to bitmap operations, and a “divide‑and‑conquer” strategy that leverages Doris colocate groups and newer bitmap functions to reduce network I/O and improve query speed.
Optimizations achieve sub‑second audience estimation and minute‑level audience selection, meeting operational goals. Future work includes tighter integration of business modules, enhanced A/B testing, automated SQL rewriting for complex queries, and faster data ingestion by writing directly to Doris tablets via Spark.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.