Building a Scalable Big Data Service System at Didi: Practices and Lessons
Zhang Liang shares Didi's four-stage journey of constructing and governing large‑scale open‑source big‑data engine services—including engine selection, hardware sizing, PaaS platform building, proxy architecture, and governance—highlighting practical challenges, solutions, and ROI‑driven best practices for Kafka, Elasticsearch, Flink, and related technologies.
Zhang Liang, head of Didi Cloud Commercial Data, has been with Didi since 2014 and has led the design and development of high‑concurrency, high‑throughput data systems such as LogAgent, Kafka, Elasticsearch, and OLAP engines.
He shares the background that Didi built a large‑scale open‑source data engine ecosystem (LogAgent, Kafka, Flink, Elasticsearch, ClickHouse) to support business‑driven BI, real‑time streaming (hundreds of MB/s) and petabyte‑level storage, encountering stability, usability, and operations challenges.
The construction process is divided into four stages:
Engine Experience Phase : early decisions on engine, version, deployment architecture, and stability work.
Engine Development Phase : rapid user growth leads to heavy support workload; a PaaS layer is needed to lower the learning curve and improve self‑service.
Engine Breakthrough Phase : scaling to large clusters pushes engine limits; internal iteration, bug fixing, and version upgrades become essential.
Engine Governance Phase : with a mature PaaS platform, governance focuses on SLA differentiation, misuse prevention, and cost‑aware resource management.
1) Engine Selection – Evaluate community star count, contributor activity, PMC/Committer presence, production adoption, meetup frequency, issue‑response speed, documentation richness, and deployment simplicity.
2) Hardware Sizing – Classify workloads as IOPS‑oriented or TPS‑oriented; benchmark engines (e.g., Elasticsearch) on candidate machines; aim for balanced CPU, disk, network, and I/O usage. Didi’s 2020 optimization cut Elasticsearch log‑cluster cost by 50% and kept CPU utilization around 50%.
3) Deployment Choices – For Elasticsearch, a minimal HA cluster uses three nodes (master, client, data). Small clusters (3‑5 nodes) are fine; larger clusters (>10 nodes, >100 MB/s write) require role‑based deployment to avoid resource contention.
Deployment models can be single‑tenant per‑cluster for latency‑sensitive workloads or multi‑tenant large clusters for cost‑effective batch processing.
4) PaaS Platform Construction – Didi built Logi‑KafkaManager, an open‑source Kafka monitoring and control platform (https://github.com/didi/Logi‑KafkaManager) to automate resource creation, schema changes, and provide FAQs and best‑practice guidance.
5) Engine Service Architecture Upgrade – Adopt proxy layers (Kafka‑Gateway, ES‑Gateway, DB‑Proxy, Redis‑Proxy) to add enterprise features (security, rate‑limiting, high‑availability) while keeping the underlying open‑source engine untouched.
Proxy architecture also enables smooth major version upgrades, exemplified by Didi’s migration from Elasticsearch 2.3 to 6.6.1.
6) Deep Engine Mastery – Developers should set up local environments, run test suites, read official docs, and regularly share source‑code insights; community participation (Meetups, Issue tracking) is essential for staying current.
7) Internal Branch Iteration – Bug fixes are applied either by back‑porting community patches or contributing new patches; Didi contributes >150 patches annually to Apache projects (Hadoop, Spark, Hive, Flink, HBase, Kafka, Elasticsearch, etc.).
8) Governance – Align business value with technical ROI by monitoring resource utilization, categorizing services (core vs. non‑core), and applying tiered SLA guarantees. Technical innovations such as FastIndex for offline indexing and Elasticsearch TPS‑doubling techniques have saved Didi millions of RMB annually.
Finally, the speaker invites the audience to join the DataFunTalk community, attend upcoming big‑data salons, and scan QR codes for more resources.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.