Design and Evolution of Zhihu's Event‑Tracking (埋点) System
This article presents a comprehensive overview of Zhihu's event‑tracking system, covering its evolution from early Hadoop‑based pipelines to cloud‑native architectures, detailing toolsets for requirement management, validation, data collection, querying, and service design, and concluding with a practical Q&A on best practices and optimization.
The talk introduces the concept of event‑tracking (埋点) as a core data source in the era of big data and AI, emphasizing its growing importance for data collection, storage, and analysis across marketing, product optimization, and user profiling.
It then provides an overview of the tracking toolchain, which includes requirement‑management tools, validation tools, data‑collection tools, and query tools, each serving to improve efficiency and data quality.
In the requirement‑management section, Zhihu's platform evolution from version 1.0 to 2.0 is described, highlighting cost reduction, workflow simplification, and an intelligent feature that auto‑generates tracking code and routes requirements to responsible owners.
The validation segment explains the shift from manual packet capture to a platform‑based verification solution, noting technical upgrades such as cloud‑native high‑availability architecture, message‑queue middleware, and rapid test‑report generation.
Data‑collection improvements are detailed: the 1.0 pipeline relied on Python code, local buffers, and Kafka with high latency, while the 2.0 redesign adopts a modular, multi‑queue approach that cuts processing time to 1/15 of the original and ensures sub‑30 ms latency.
For data querying, the system offers a web‑API layer built on Doris for high‑throughput dimensional queries and Presto on Hive for both batch and real‑time analytics, enabling analysts and product teams to retrieve insights quickly.
The data‑service architecture is explained as a three‑pillar design—data integration, logical modeling, and cloud‑native deployment—addressing heterogeneous source integration, reusable logical models, full‑link data lineage, and protection against schema changes.
The presentation concludes with a Q&A covering topics such as responsible roles for parameter design, client‑ vs‑server‑side session reporting, characteristics of a good tracking system, version relationships, and cost‑optimization strategies through lifecycle management of tracking versions and warehouse tables.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.