Kafka Core Concepts, Architecture, Performance Tuning, and Cluster Capacity Planning
This article provides a comprehensive overview of Kafka, covering its core value for decoupling and asynchronous processing, fundamental concepts such as producers, consumers, topics, partitions and replication, high‑performance mechanisms like zero‑copy and OS cache, detailed resource evaluation for CPU, memory, disk and network, operational tools, consumer‑group rebalance strategies, LEO/HW offsets, controller management, and delayed‑task scheduling.
The article begins by explaining the core value of a message queue: decoupling services, handling asynchronous workloads (e.g., flash‑sale scenarios), and controlling traffic by queuing requests before processing.
Kafka Core Concepts are introduced, describing producers, consumers, topics, partitions, replication factors, leader/follower roles, and the in‑sync replica (ISR) list.
It details the storage layout: each partition corresponds to a directory on a broker, log files (.log) are written sequentially (append‑only), and default log segment size is 1 GB.
High‑Performance Mechanisms include zero‑copy data transfer using Linux sendfile , OS page cache usage, and sequential disk writes, which together enable high throughput and low latency.
Resource evaluation for a large e‑commerce scenario (1 billion requests per day) is performed, estimating peak QPS (~55 k), total data volume (~276 TB), and required hardware: 5 physical machines, each with 11 × 7 TB SAS disks (≈77 TB), ~64 GB RAM (10 GB for JVM, rest for OS cache), and 16–32 CPU cores.
Disk type selection is discussed, concluding that ordinary mechanical disks are sufficient because Kafka writes sequentially.
Memory sizing considers OS cache and JVM needs; keeping the latest 25 % of each log segment in memory leads to an estimate of ~60 GB RAM per node.
CPU requirements are derived from the number of internal threads (100+), recommending at least 16 cores per broker.
Network bandwidth calculations suggest using 10 Gbps NICs to comfortably handle peak traffic (≈1 GB/s per node).
The article outlines cluster planning steps: determining the number of physical machines, disk count, memory, CPU, and network resources.
Operational Tools such as KafkaManager and Kafka‑Offset‑Monitor are introduced, with example commands wrapped in kafka-topics.sh , kafka-reassign-partitions.sh , and JSON files for partition reassignment.
Consumer‑group concepts are explained, including group IDs, partition assignment, broadcast behavior, and automatic rebalancing when a consumer fails.
Rebalance strategies (range, round‑robin, sticky) are compared, showing how partitions are redistributed among consumers.
The article defines offset management: older ZK‑based storage versus the newer __consumer_offsets topic, and how offsets are compacted.
Key broker metrics such as fetch.max.bytes , max.poll.records , enable.auto.commit , and auto.offset.reset are described.
LEO (Log End Offset) and HW (High Watermark) are explained: LEO tracks the latest offset per replica, while HW marks the highest offset replicated to all in‑sync replicas, making data visible to consumers.
The controller’s role in managing the cluster (broker registration, topic creation, partition reassignment) is illustrated.
Finally, the article discusses Kafka’s delayed‑task mechanism, implemented via a time‑wheel (O(1) insertion/removal) to handle producer acknowledgments, follower fetch delays, and other timeout‑based operations.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.