Master Kafka Basics: Topics, Partitions, Producers, and Cluster Architecture
This article explains Kafka's role as a messaging system, covering core concepts such as topics, partitions, producers, consumers, messages, cluster architecture, replicas, consumer groups, controller coordination with Zookeeper, and performance optimizations like sequential writes and zero‑copy networking.
Kafka Basics
Kafka is a distributed messaging system that acts as a buffer and decouples producers from consumers, storing data on disk rather than in memory.
Message System Role
It functions like a warehouse, providing caching and decoupling capabilities for large‑scale log processing scenarios.
1. Topic
A topic is analogous to a table in a relational database; each topic holds a stream of messages.
To consume data from a specific source, you simply listen to the corresponding topic (e.g., TopicA for China Mobile).
2. Partition
Each topic is divided into multiple partitions, which are stored as directories on different brokers. Partitions improve performance by allowing parallel processing across multiple threads.
Partitions are similar to HBase regions: the topic is a logical concept, while partitions are the physical storage units distributed across servers.
Partitions can become single points of failure, so replicas are configured.
Partition numbering starts at 0.
3. Producer
Producers send messages to Kafka.
4. Consumer
Consumers read messages from Kafka.
5. Message
The data processed within Kafka is called a message.
Kafka Cluster Architecture
A topic can have multiple partitions distributed across different brokers. Early Kafka versions (<0.8) lacked replication, leading to data loss on broker failures.
Replica
Each partition can have multiple replicas for fault tolerance. One replica acts as the leader, while others are followers that synchronize from the leader.
Consumer Group
Consumers belong to a consumer group identified by
group.id. Within a group, each partition is consumed by only one consumer, preventing duplicate processing.
<code>conf.setProperty("group.id","tellYourDream")</code>Different groups can consume the same topic independently.
<code>consumerA:
group.id = a
consumerB:
group.id = a
consumerC:
group.id = b
consumerD:
group.id = b</code>Controller
The controller is the master node that coordinates the cluster together with Zookeeper.
Kafka and Zookeeper Coordination
All brokers register themselves in Zookeeper at startup, which elects a controller. The controller watches Zookeeper directories (e.g., /brokers/) to track broker registrations and manage metadata.
Performance Highlights
Sequential Writes
Kafka writes data sequentially to disk, achieving near‑memory speeds because disk seeks are minimized.
Zero‑Copy
Kafka uses Linux
sendFileto transfer data directly from disk to the network socket, eliminating extra memory copies and context switches.
Log Segmentation
Each partition’s log file is limited to 1 GB to simplify loading segments into memory.
<code>00000000000000000000.index
00000000000000000000.log
00000000000000000000.timeindex
00000000000005367851.index
00000000000005367851.log
00000000000005367851.timeindex
00000000000009936472.index
00000000000009936472.log
00000000000009936472.timeindex</code>Network Design
Clients connect to an Acceptor, which forwards requests to a pool of processor threads. Processors handle reads and writes, and a thread pool processes responses, forming a three‑layer reactor model.
Conclusion
This article introduced Kafka’s core concepts, roles, and design considerations. Future updates will cover cluster deployment and deeper performance tuning.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.