Big Data 22 min read

Optimizing Kafka at Meituan: Challenges and Solutions for a Large‑Scale Data Platform

This article details Meituan's use of Kafka as a unified data cache and distribution layer, outlines the challenges of massive scale and latency, and presents comprehensive optimizations across application, system, and cluster management layers, including disk balancing, migration acceleration, fetcher isolation, and full‑link monitoring.

Top Architect
Top Architect
Top Architect
Optimizing Kafka at Meituan: Challenges and Solutions for a Large‑Scale Data Platform

Introduction

Kafka serves as the unified data cache and distribution component in Meituan's data platform. With rapid data growth and expanding cluster size, Kafka faces increasingly severe challenges.

1. Current Status and Challenges

Meituan's Kafka cluster exceeds 15,000 machines, with single clusters reaching over 2,000 nodes. Daily message volume surpasses 30 PB, peaking at over 400 million messages per second. The main challenges are slow nodes affecting read/write latency and the complexity of managing a massive cluster.

2. Read/Write Latency Optimizations

2.1 Overview

Latency issues are addressed at both the application and system layers. Solutions include pipeline acceleration, fetcher isolation, migration cancellation, and consumer asynchronous processing.

2.2 Application Layer

Disk imbalance: generate and submit rebalancing plans via a Rebalancer component to evenly distribute partitions.

Migration optimization: pipeline acceleration, migration cancellation, and fetcher isolation to prevent migration from blocking real‑time reads.

Consumer asynchronous model: introduce an async fetch thread to pull data continuously, avoiding the single‑thread bottleneck of the native Kafka consumer.

2.3 System Layer

PageCache pollution mitigation.

HDD random‑read performance improvement using RAID cards.

Cgroup isolation to avoid resource contention between I/O‑intensive Kafka and CPU‑intensive Flink/Storm workloads.

SSD‑based cache architecture to separate hot data on SSD from cold data on HDD, preventing cache pollution.

3. Large‑Scale Cluster Management Optimizations

3.1 Isolation Strategy

Business isolation: separate Kafka clusters per major business (e.g., delivery, in‑store, selection).

Role isolation: deploy brokers, controllers, and Zookeeper on distinct machines.

Priority isolation: assign VIP clusters to high‑availability topics.

3.2 Full‑Link Monitoring

Collect and monitor metrics from all Kafka components. When a client request slows, the full‑link view pinpoints the bottleneck (e.g., RemoteTime, RequestQueueTime) and enables rapid fault detection.

4. Service Lifecycle Management

Introduce a lifecycle management mechanism that ties service state to machine state, automating status changes and preventing manual errors.

5. TOR Disaster Recovery

Ensure replicas of the same partition reside on different racks, guaranteeing availability even if an entire rack fails.

Future Outlook

Future work focuses on improving robustness through finer‑grained isolation, client‑side fault avoidance, multi‑queue request segregation, hot‑swap capabilities, and cloud‑native deployment of Kafka.

distributed systemsperformance optimizationBig DataKafkaMeituan
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.