Design and Implementation of Didi's Logi‑KafkaManager Multi‑tenant Kafka Cloud Platform
Didi’s Logi‑KafkaManager is a multi‑tenant Kafka cloud platform that consolidates dozens of clusters into a secure, isolated gateway‑driven service offering intuitive web‑based topic management, real‑time metrics visualization, automated diagnostics, quota governance and safe scaling, delivering high internal satisfaction and enterprise commercialization.
Didi operates dozens of Kafka clusters with more than 450 nodes, over 20,000 topics and a daily traffic of more than 2 trillion messages. Hundreds of users create topics, request quotas, view metrics and perform other operations every week, which creates a heavy operational burden.
To meet these needs, Didi built Logi‑KafkaManager, a shared‑multi‑tenant Kafka cloud platform focused on operational control, monitoring, alerting and resource governance. The platform achieved a 90 % internal satisfaction score and has been commercialized with several enterprises.
Requirement Analysis
Data security – topic‑level access control is weak in the public clusters, leading to potential data leakage.
Service stability – traffic spikes on a single topic can affect other topics sharing the same cluster.
User friendliness – high‑frequency operations such as topic creation, quota adjustment and metric viewing must be simple.
Problem locating efficiency – operators need quick insight into message rates, backlog, partition distribution, broker load, replica sync, etc.
Operational convenience – deployment, upgrade, scaling, topic migration and leader rebalance must be safe and efficient.
Overall Design
The design follows a “one point three‑ization” principle:
One point: security and stability are the core; a Kafka gateway provides authentication, multi‑tenant isolation and traffic limiting.
Platformization: high‑frequency user and ops actions are encapsulated in a web platform to reduce manual effort.
Visualization: metrics are displayed intuitively so users can perceive cluster health at a glance.
Expertization: operational experience is encoded into the platform, offering intelligent diagnostics.
The system is divided into five layers:
Resource layer: only MySQL (besides Zookeeper) is required, simplifying deployment.
Engine layer: based on Kafka 2.5 with custom features such as disk‑overload protection, fully compatible with open‑source Kafka.
Gateway layer: provides security checks, topic rate‑limiting, service discovery and downgrade capabilities.
Service layer: offers topic management, monitoring, cluster management and other functions via the web UI.
Platform layer: a unified web portal for ordinary users and operators, exposing high‑frequency operations.
Security Architecture
A metadata gateway is introduced to enforce authentication and isolation. Multi‑tenant isolation is achieved by grouping brokers into logical regions; a topic can only be created within a selected region, preventing cross‑topic interference.
Key Features
Metadata gateway – service discovery, transparent broker address changes, and per‑topic rate limiting.
Multi‑tenant isolation model – logical clusters (regions) isolate topics and their traffic.
Gold‑metric visualization – important metrics such as messageIn, byteIn are highlighted; a health score aggregates weighted metrics.
Expert services – automated detection of hotspot partitions, partition insufficiency, and unused topics, with guided remediation.
Partition hotspot migration – detect uneven disk distribution, trigger controlled migration without affecting stability.
Partition expansion – monitor per‑partition traffic, alert when a topic exceeds its capacity, and automatically expand partitions.
Topic resource governance – identify idle topics and clean them up to free resources.
Future Outlook
Smooth cross‑cluster migration using MirrorMaker + KafkaGateway.
Support for newer Kafka versions (2.5+) and addition of more critical metrics.
Continuous open‑source contributions and community collaboration.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.