Big Data 22 min read

Meituan's Kafka Architecture: Challenges and Optimizations at Massive Scale

This article details how Meituan's Kafka platform, serving over 15,000 machines and handling petabytes of daily traffic, faces read/write latency, slow nodes, and large‑scale cluster management challenges, and describes a series of application‑layer, system‑layer, and operational optimizations—including disk balancing, migration pipelines, fetcher isolation, consumer async, SSD caching, isolation strategies, full‑link monitoring, lifecycle management, and TOR disaster recovery—to improve performance and reliability.

Top Architect
Top Architect
Top Architect
Meituan's Kafka Architecture: Challenges and Optimizations at Massive Scale

Kafka serves as the unified data cache and distribution layer in Meituan's data platform, handling over 15,000 machines, clusters up to 2,000 nodes, and daily traffic exceeding 30 PB with peaks of 4 billion messages per second.

Current Situation and Challenges

Slow nodes (TP99 > 300 ms) caused by load imbalance, PageCache capacity limits, and Consumer thread model defects.

Complexity of managing a massive cluster, including topic interference, insufficient broker metrics, delayed fault perception, and rack‑level failures.

Read/Write Latency Optimizations

Application Layer

Disk balancing using an idle‑disk‑first partition migration plan managed by Rebalancer.

Migration pipeline acceleration, migration cancellation, and Fetcher isolation to separate real‑time and delayed reads.

Consumer async redesign with an asynchronous fetch thread to avoid single‑thread bottlenecks.

System Layer

Raid card acceleration to improve HDD random write performance.

Cgroup isolation to prevent resource contention between IO‑intensive Kafka and CPU‑intensive Flink/Storm workloads.

Hybrid SSD cache architecture that stores recent segments on SSD, synchronizes to HDD, and avoids PageCache pollution.

Large‑Scale Cluster Management Optimizations

Business, role, and priority isolation (separate clusters per business, dedicated broker/controller nodes, VIP clusters).

Full‑link monitoring of Kafka components to quickly locate latency sources and detect faults.

Service lifecycle management linking service and machine states with automated state transitions.

TOR disaster recovery ensuring replicas of a partition are placed on different racks. Future Outlook Plans include further improving robustness through finer‑grained isolation, client‑side fault avoidance, hot‑drain support, network back‑pressure, and exploring cloud‑native deployments of Kafka.

performanceOptimizationStreamingKafkadistributed-systemsMeituan
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.