Didi's Big Data Cost Governance Practices and Framework
This article presents Didi's comprehensive big data cost governance approach, detailing the overall framework, data system architecture, asset management platform, Hadoop and Elasticsearch cost‑control practices, metadata‑driven optimization, and organizational insights for effective resource and budget management.
The presentation outlines Didi's big data cost governance, beginning with an overview of the overall framework and the four main parts: the general cost governance architecture, Hadoop cost governance, Elasticsearch (ES) cost governance, and practical insights.
Didi's data system consists of a data engine layer (offline, real‑time, OLAP, NoSQL, log retrieval, data pipelines), a data computation layer powered by the self‑developed "Data Dream Factory" platform, a data service layer with the "ShuYi" consumption platform, and a top layer for data science applications. Metadata‑centric products such as the Data Map and Asset Management Platform manage the entire data lifecycle.
The Asset Management Platform evaluates data usage across six health scores—compute, storage, security, quality, model, and value—aggregating them into a DataRank score; cost governance focuses on improving compute and storage health.
Cost governance work follows five core directions (cost, security, quality, model, value). After 2019, the platform initially addressed Hadoop storage costs (Governance 1.0). In 2022, Didi introduced pricing, cost visibility, and advanced governance (Governance 2.0) to tackle harder optimization items and demonstrate value.
Product cost is broken down into hardware (servers, network, middleware), and maintenance labor. By estimating total cost and applying a profit margin, Didi derives target revenue and allocates pricing to usage dimensions, enabling detailed cost‑by‑person, project, organization, and account analysis.
Metadata from storage (Metastore, FSimage) and compute (runtime logs) is synchronized to an ODS layer, cleaned, and aggregated to produce billing statements that can be drilled down to identify cost drivers.
Hadoop cost governance includes pricing based on historical rates, server efficiency targets, usage forecasts, and profit expectations. Metadata ingestion enables cost visibility, and governance actions target both "invalid assets" (unused data) and "valid assets" (inefficiently used data). Recommendations such as lifecycle adjustments, empty tables, and data skew mitigation are generated from audit, Spark, and HDFS logs.
Data skew detection uses Spark and Hadoop execution logs to compute skew ratios; tasks exceeding 15 minutes and a skew ratio > 5 are flagged. Mitigation strategies include hotspot handling, broadcast joins, repartitioning, and parameter tuning.
Parameter optimization analyzes GC logs and memory usage to suggest appropriate memory sizes and concurrency levels, achieving roughly 1 TB of daily memory savings.
Table‑level lineage is built from four sources: SQL parsing via ANTLR, Yarn path mapping, Spark LogicalPlan JSON output, and tag‑based scheduling dependencies. The union of these methods yields a 99.97 % accurate lineage graph, which, combined with access logs, informs safe resource decommissioning.
ES cost governance mirrors Hadoop: pricing distinguishes shared versus dedicated clusters, with storage fees for hot and cold data. Metadata (snapshots, gateway logs, index info) is processed to provide per‑user, project, and account cost breakdowns, and governance actions target empty templates, lifecycle settings, and index field optimizations (e.g., disabling inverted or forward indexes).
Beyond Hadoop and ES, Didi applies similar cost governance to Flink, ClickHouse, and other platform components, following the same pricing, visibility, and metadata‑driven remediation workflow.
Key organizational insights emphasize top‑down budget targets, clear responsibility chains from the corporate budgeting committee to product teams, and incentive mechanisms such as data governance competitions to motivate engineers to actively participate in cost reduction.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.