Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters
This article describes how Meituan’s data platform tackles the growing challenges of a 15,000‑plus‑node Kafka deployment by detailing current bottlenecks, latency‑reduction techniques across application and system layers, large‑scale cluster management strategies, and future directions for robustness and cloud‑native migration.
Kafka serves as the unified data cache and distribution layer in Meituan’s data platform; with over 15,000 machines and daily traffic exceeding 30 PB, the cluster faces severe scalability and latency challenges.
Current State and Challenges include slow nodes (brokers with TP99 > 300 ms) caused by load imbalance, insufficient PageCache, and consumer thread‑model defects, as well as the complexity of managing a massive multi‑tenant cluster.
Read/Write Latency Optimizations are divided into application‑layer and system‑layer improvements. Application‑layer measures address disk hotspot balancing, migration pipeline acceleration, migration cancellation, fetcher isolation, and consumer asynchronous processing. System‑layer measures introduce RAID‑card acceleration, cgroup resource isolation, and a hybrid SSD‑HDD cache architecture that separates hot data onto SSD while preventing PageCache pollution.
Application‑Layer Details include a three‑step partition‑migration plan (generate, submit, verify), pipeline‑accelerated migration to avoid blocking, cancelable migrations to stop long‑tail transfers, fetcher isolation to keep ISR and non‑ISR followers separate, and an async fetch thread that pre‑pulls data to keep response times low.
System‑Layer Details cover using RAID cards to improve random‑write performance on HDDs, cgroup isolation to avoid CPU‑cache and NUMA contention between Kafka and CPU‑intensive workloads, and a new SSD‑based cache that stores recent segments on SSD, synchronizes them to HDD, and evicts old data without polluting the cache.
Large‑Scale Cluster Management Optimizations involve isolation strategies (business‑level clusters, role separation, priority‑based VIP clusters), full‑link monitoring that captures end‑to‑end request latency across broker, processor, request handler, delayed purgatory, and response stages, and a service lifecycle management system that ties service and machine states together.
TOR Disaster Recovery ensures replicas of a partition are placed on different racks, guaranteeing availability even if an entire rack fails.
Future Outlook focuses on improving robustness through finer‑grained isolation, client‑side fault avoidance, hot‑swap server upgrades, network back‑pressure, and exploring cloud‑native deployments of the streaming storage service.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.