Big Data 22 min read

Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms

This article presents Meituan's large‑scale Kafka deployment, describing the current state and challenges of massive data ingestion, detailing latency‑reduction techniques, cluster‑level optimizations, SSD‑based caching, isolation strategies, full‑link monitoring, lifecycle management, and future directions for high availability.

DataFunTalk
DataFunTalk
DataFunTalk
Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms

The talk, presented by Meituan's storage engineer Zhao Haiyuan and edited by Liu Ming, introduces the practical use of Kafka within Meituan's data platform.

Current State and Challenges – Meituan operates over 7,500 Kafka nodes, with single clusters up to 1,500 machines, handling daily traffic exceeding 21 PB and 11.3 trillion messages. Challenges include maintaining low read/write latency under massive load and managing a large, complex cluster.

Latency Optimization – Issues stem from slow nodes (tp99 > 300 ms), disk‑I/O bottlenecks, and consumer thread model flaws. Solutions are split into application‑layer and system‑layer improvements: disk balancing via partition migration, pipeline acceleration, migration cancellation, fetcher isolation, and asynchronous consumer processing.

Application‑Layer Optimizations – Implemented disk‑balancing with a rebalancer that generates migration plans based on broker disk usage, introduced pipeline acceleration to avoid long‑tail partition stalls, added migration cancellation to prevent page‑cache pollution, and isolated fetchers for ISR and non‑ISR followers.

System‑Layer Optimizations – Added RAID‑0 cards for faster sequential writes, employed cgroup and NUMA‑aware CPU isolation to eliminate interference from CPU‑intensive workloads, and designed a hybrid SSD‑HDD caching architecture that stores recent segments on SSD while syncing older data to HDD, with careful eviction and write‑rate limiting.

Large‑Scale Cluster Management – Adopted business, role, and priority isolation (separate clusters per business, distinct broker/controller nodes, VIP clusters), full‑link monitoring to trace latency across request stages, lifecycle management to synchronize service and machine states, and TOR disaster‑recovery to ensure rack‑level fault tolerance.

Future Outlook – Plans include proactive client‑side fault avoidance, multi‑queue request isolation, quorum‑write enhancements, and exploring unified batch‑stream storage such as Kafka on HDFS.

The presentation concludes with acknowledgments and a call for audience interaction.

monitoringKafkacluster managementMeituanlarge-scale dataRead/Write LatencySSD Caching
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.