Big Data 23 min read

Optimizing Kafka at Meituan: Challenges and Solutions for Large‑Scale Cluster Management

This article details Meituan's Kafka deployment, describing the current massive scale and associated challenges, and presents a series of optimizations—including read/write latency reductions, application‑ and system‑level improvements, large‑scale cluster management strategies, full‑link monitoring, service lifecycle management, and future directions—to enhance performance, reliability, and scalability of the streaming platform.

Top Architect

Oct 2, 2022

Optimizing Kafka at Meituan: Challenges and Solutions for Large‑Scale Cluster Management

Kafka serves as the unified data cache and distribution layer in Meituan's data platform, handling over 15,000 machines and daily traffic exceeding 30 PB, which brings significant scalability and latency challenges.

1. Current Situation and Challenges

The main challenges are slow nodes affecting read/write latency (TP99 > 300 ms) due to load imbalance, PageCache capacity limits, and consumer thread model defects, as well as the complexity of managing a large‑scale cluster, including topic interference, insufficient broker metrics, delayed fault detection, and rack‑level failures.

2. Read/Write Latency Optimization

2.1 Overview – The latency issues are divided into application‑layer and system‑layer factors.

Application Layer

Disk imbalance causing hotspots and uneven utilization.

Inefficient partition migration (serial batch submission, night‑time execution, shared Fetcher threads).

Consumer single‑thread model causing metric distortion.

System Layer

PageCache pollution from mixed read/write workloads.

HDD random‑write performance degradation.

Resource contention between I/O‑intensive Kafka and CPU‑intensive Flink/Storm in mixed deployments.

Solutions include disk balancing with a free‑disk‑first migration plan managed by Rebalancer, pipeline acceleration for migration, migration cancellation, Fetcher isolation, and consumer asynchronous pulling.

2.2 Application‑Layer Optimizations

Disk balancing: generate and submit migration plans to evenly distribute partitions.

Pipeline acceleration: allow new partitions to be submitted while a slow partition is still processing.

Migration cancellation: abort long‑running migrations to prevent PageCache pollution and allow partition expansion.

Fetcher isolation: separate ISR followers from non‑ISR followers to protect real‑time reads.

Consumer asynchronous pulling: introduce background threads to fetch ready data and feed the CompleteQueue, limiting concurrency to avoid GC/OOM.

2.3 System‑Layer Optimizations

Raid card acceleration: use RAID cache to merge writes and improve random‑write performance on HDDs.

Cgroup isolation: dedicate physical cores to Kafka, keep all hyper‑threads on the same NUMA node, and prevent CPU contention with Flink.

Hybrid SSD cache architecture: store recent segments on SSD, sync to HDD asynchronously, avoid PageCache pollution, and implement space‑aware eviction.

3. Large‑Scale Cluster Management Optimization

Isolation strategy: business‑level isolation (separate clusters per business), role‑level isolation (dedicated brokers, controllers, Zookeeper), and priority isolation (VIP clusters for high‑availability topics).

Full‑link monitoring: collect metrics from all Kafka components, enabling rapid pinpointing of bottlenecks such as RemoteTime dominance.

Service lifecycle management: integrate service and machine state, automate status changes, and prohibit manual overrides.

TOR disaster recovery: ensure replicas of a partition are placed in different racks to survive rack failures.

4. Future Outlook

Future work will focus on improving robustness through finer‑grained isolation, client‑side fault avoidance, multi‑queue request segregation, hot‑swap of services, network back‑pressure, and exploring cloud‑native deployment of Kafka while maintaining current cost and performance targets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Kafka distributed-systems big-data Meituan

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.