ByteDance Event‑Tracking Data Cost Governance Practices
This article describes ByteDance's comprehensive approach to managing the massive volume of event‑tracking (埋点) data, detailing the background, cost‑reduction strategies, experience review, future plans, and a Q&A session that together illustrate how systematic data governance can dramatically cut storage and processing expenses.
As business expands, the amount of event‑tracking (埋点) data reported by applications grows dramatically, leading to high compute and storage costs and making it difficult to extract business value; therefore, systematic governance of this data becomes essential.
Background : In ByteDance's data pipeline, SDKs on various devices send raw logs to a collection service, which aggregates them into real‑time topics. These topics undergo real‑time ETL (cleaning, distribution, standardization) before being consumed by downstream systems such as real‑time analytics, offline warehouses, user‑behavior analysis, recommendation engines, and A/B testing. Peak traffic can exceed 100 million events per second, with daily increments of 10 trillion records, resulting in over 10 PB of HDFS data per day.
These massive volumes cause resource, cost, and SLA challenges. For example, a past HDFS delivery bottleneck was alleviated by promptly deleting unused or low‑priority events through the governance mechanism.
Governance Strategy :
Control incremental data first, then treat existing data – by requiring new events to be registered on an "allowed list" before they can be reported, enforced both at the SDK level and in real‑time ETL.
Reduce useless event reporting – identify low‑value, high‑cost events through offline query usage, real‑time routing, and UBA checks, then remove them from the allowed list.
Classify events by importance – assign priority levels (P0, P1, P2, …) and provide differentiated TTL and SLA guarantees, with the priority metadata injected during real‑time ETL.
Support sampling – configure sampling ratios for events that do not need full‑volume reporting, applied at SDK and ETL stages.
These mechanisms have been deployed on over 1 billion events per second and 500 k+ metadata entries, saving billions of yuan, more than 100 PB of HDFS storage, and millions of yuan annually.
Experience Review : The governance journey highlighted the need to first control new events, then clean existing ones; to provide clear 3W1H guidance to product teams; and to evolve from removing useless events to prioritization and sampling. Metrics such as total event volume, cost per event, useless‑event ratio, and event density help decide when to trigger governance.
Automated governance was introduced, where the system monitors metric changes, flags potentially problematic events, and notifies owners for confirmation. Two modes exist: supervised (large‑scale services with dedicated oversight) and unsupervised (small services fully managed by the system), both delivering significant cost reductions.
Planning & Outlook : Future work will (1) link governance outcomes to resource allocation decisions, (2) recommend personalized remediation plans based on business‑specific data characteristics, and (3) extend governance to abnormal and low‑quality data, further improving overall data quality.
Q&A :
Lineage (offline and real‑time) is built by parsing ETL/queries and propagating metadata through the data‑flow graph.
Cost is calculated by allocating CPU, storage, and other resource consumption to each event record.
Event‑to‑metric relationships are maintained via the same lineage analysis.
The presented governance practices are currently used internally at ByteDance and are also offered externally through the Volcano Engine DataLeap suite, a one‑stop data‑platform solution for enterprises.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.