Real‑time Traffic Analysis at Alibaba: Challenges, Technical Choices, and ClickHouse Architecture
This article explains how Alibaba's traffic analysis platform evolved to require real‑time analytics, outlines the business background, data model and metric system, discusses the difficulties of big‑data processing, and describes why ClickHouse was chosen and how its features solve those challenges.
The presentation, hosted by DataFunTalk and featuring Alibaba senior data product expert Jason Xu, introduces the development of real‑time traffic analysis for the Taobao platform, describing the business background and the need for timely, granular insights.
It defines traffic analysis as a combination of a low‑level event data model and a comprehensive metric system covering scale (UV, PV), engagement (duration, depth), conversion (actions, purchases), and stickiness, emphasizing the necessity of flexible dimension slicing for diverse business units.
The article then outlines the major challenges faced: data timeliness (moving from T+1 to real‑time), lack of generic metric standards across channels, heavy OLAP workloads causing analyst bottlenecks, and the need to combine traffic and business data for comprehensive analysis.
To address these issues, the team evaluated technical options and selected ClickHouse for its column‑store architecture, high compression, and advanced features such as materialized views, aggregatingMergeTree, quota management, primary indexes, dictionaries, external tables, and JSONAsString support, enabling both high‑performance queries and flexible schema extensions.
Specific solutions include using materialized views for wide tables covering 60‑70% of queries, leveraging aggregatingMergeTree for semi‑processed data, applying quota and primary index for resource control, and employing dictionaries and external tables for efficient joins and cross‑database queries.
Product considerations focus on delivering high‑accuracy queries via materialized views, handling recent data with TTL policies, and providing flexible query capabilities through tree engines, dictionaries, and external table integrations, reducing operational overhead and improving cost efficiency.
The talk concludes with a summary of the ClickHouse capabilities that support large‑scale, real‑time traffic analysis and invites the audience to join the DataFunTalk community for further discussions.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.