Optimizing Kafka at Meituan: Challenges and Solutions for Large‑Scale Cluster Management
This article details Meituan's Kafka deployment, describing the current massive scale and associated challenges, and presents a series of optimizations—including read/write latency reductions, application‑ and system‑level improvements, large‑scale cluster management strategies, full‑link monitoring, service lifecycle management, and future directions—to enhance performance, reliability, and scalability of the streaming platform.
Kafka serves as the unified data cache and distribution layer in Meituan's data platform, handling over 15,000 machines and daily traffic exceeding 30 PB, which brings significant scalability and latency challenges.
1. Current Situation and Challenges
The main challenges are slow nodes affecting read/write latency (TP99 > 300 ms) due to load imbalance, PageCache capacity limits, and consumer thread model defects, as well as the complexity of managing a large‑scale cluster, including topic interference, insufficient broker metrics, delayed fault detection, and rack‑level failures.
2. Read/Write Latency Optimization
2.1 Overview – The latency issues are divided into application‑layer and system‑layer factors.
Application Layer
Disk imbalance causing hotspots and uneven utilization.
Inefficient partition migration (serial batch submission, night‑time execution, shared Fetcher threads).
Consumer single‑thread model causing metric distortion.
System Layer
PageCache pollution from mixed read/write workloads.
HDD random‑write performance degradation.
Resource contention between I/O‑intensive Kafka and CPU‑intensive Flink/Storm in mixed deployments.
Solutions include disk balancing with a free‑disk‑first migration plan managed by Rebalancer, pipeline acceleration for migration, migration cancellation, Fetcher isolation, and consumer asynchronous pulling.
2.2 Application‑Layer Optimizations
Disk balancing: generate and submit migration plans to evenly distribute partitions.
Pipeline acceleration: allow new partitions to be submitted while a slow partition is still processing.
Migration cancellation: abort long‑running migrations to prevent PageCache pollution and allow partition expansion.
Fetcher isolation: separate ISR followers from non‑ISR followers to protect real‑time reads.
Consumer asynchronous pulling: introduce background threads to fetch ready data and feed the CompleteQueue, limiting concurrency to avoid GC/OOM.
2.3 System‑Layer Optimizations
Raid card acceleration: use RAID cache to merge writes and improve random‑write performance on HDDs.
Cgroup isolation: dedicate physical cores to Kafka, keep all hyper‑threads on the same NUMA node, and prevent CPU contention with Flink.
Hybrid SSD cache architecture: store recent segments on SSD, sync to HDD asynchronously, avoid PageCache pollution, and implement space‑aware eviction.
3. Large‑Scale Cluster Management Optimization
Isolation strategy: business‑level isolation (separate clusters per business), role‑level isolation (dedicated brokers, controllers, Zookeeper), and priority isolation (VIP clusters for high‑availability topics).
Full‑link monitoring: collect metrics from all Kafka components, enabling rapid pinpointing of bottlenecks such as RemoteTime dominance.
Service lifecycle management: integrate service and machine state, automate status changes, and prohibit manual overrides.
TOR disaster recovery: ensure replicas of a partition are placed in different racks to survive rack failures.
4. Future Outlook
Future work will focus on improving robustness through finer‑grained isolation, client‑side fault avoidance, multi‑queue request segregation, hot‑swap of services, network back‑pressure, and exploring cloud‑native deployment of Kafka while maintaining current cost and performance targets.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.