Didi's Big Data Cost Governance Practices
This article details Didi's comprehensive big data cost governance framework, covering its data architecture, asset management scoring, Hadoop and Elasticsearch cost optimization methods, and practical insights on organizational processes and incentives for effective cost control.
Introduction: Didi shares its practice of big data cost governance, covering overall framework, Hadoop cost governance, Elasticsearch cost governance, and lessons learned.
Data system: Didi's data architecture consists of storage engines (offline, real‑time, OLAP, NoSQL, log retrieval, data pipelines), a data development platform “Data Dream Factory”, a data service platform “ShuYi”, and a data science layer, managed via metadata‑centric products “Data Map” and “Asset Management Platform”.
Asset Management Platform: Measures six health scores (compute, storage, security, quality, model, value) and aggregates them into a DataRank; cost governance focuses on improving compute and storage scores.
Cost governance workflow: Includes pricing, cost visibility, and optimization. Pricing is derived from hardware, middleware, and labor costs; cost visibility uses metadata (storage and compute) synchronized to a data warehouse and aggregated by user, project, organization, and account.
Hadoop practice: Pricing splits into compute (CPU × runtime) and storage (regular, cold, file count). Metadata from Metastore, FSImage, and logs are cleaned and analyzed to identify “ineffective assets” (unused data) and “effective assets” (low‑efficiency usage). Recommendations include lifecycle adjustment, empty tables, data skew mitigation, and parameter tuning, achieving significant memory savings.
Elasticsearch practice: Pricing distinguishes shared‑cluster storage (regular and cold) and dedicated‑cluster hardware. Cost visibility uses storage snapshots, index metadata, and gateway logs. Governance actions target empty templates, missing or unreasonable lifecycles, low‑access indices, and index field optimizations (disabling inverted or stored fields).
Governance insights: Successful cost governance requires top‑down budget targets, clear responsibility chains, and incentives such as data‑governance competitions to motivate engineers.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.