Big Data 23 min read

Mastering Kafka: From Basics to Advanced Operations and Performance Tuning

This article provides a comprehensive overview of Apache Kafka, covering its architecture, core concepts such as topics, partitions, and replicas, common operational commands, and practical performance‑tuning tips for high‑throughput, low‑latency streaming workloads.

Efficient Ops

Sep 24, 2023

Mastering Kafka: From Basics to Advanced Operations and Performance Tuning

1. Kafka Overview

Kafka was originally developed by LinkedIn in Scala as a multi‑partition, multi‑replica distributed messaging system coordinated by ZooKeeper, and later donated to the Apache Foundation. It is now positioned as a distributed streaming platform written in Scala and Java, offering high throughput, persistence, horizontal scalability, and stream processing capabilities.

Kafka works as a publish‑subscribe system: producers write messages to a specific topic, and consumers subscribe to that topic to receive messages with low latency and high throughput. Within a consumer group, each partition can be consumed by only one consumer at a time.

For a given topic, each partition can be consumed by only one consumer in the same consumer group at a time.

Compared with AMQ, Kafka is lightweight, has few dependencies, low resource consumption, simple deployment, and is easy to use.

Many open‑source distributed processing systems such as Cloudera, Storm, Spark, and Flink integrate with Kafka. Kafka plays three major roles:

Message System : Provides decoupling, redundancy, traffic shaping, buffering, asynchronous communication, scalability, and recoverability, plus strong ordering guarantees and replay capability.

Storage System : Persists messages to disk, reducing data‑loss risk; with replication, Kafka can serve as a long‑term storage system.

Streaming Platform : Supplies reliable data sources for streaming frameworks and offers a rich library of window, join, transform, and aggregation operations.

2. Problems Solved by Kafka

Kafka addresses typical message‑queue concerns such as asynchronous processing, service decoupling, and traffic control.

3. Technical Features

High Throughput & Low Latency : Handles hundreds of thousands of messages per second with millisecond‑level latency; topics can be split into many partitions, and consumer groups consume partitions in parallel.

Scalability : Supports hot‑scale‑out of clusters.

Durability & Reliability : Messages are persisted to disk and replicated; they are not deleted immediately after consumption.

Fault Tolerance : Allows n‑1 node failures when the replication factor is n.

High Concurrency : Supports thousands of simultaneous clients.

Queue Mode : All consumers belong to a single queue, enabling parallel consumption of partitioned data.

Publish‑Subscribe Mode : Consumers receive broadcasted messages from a topic.

4. Kafka Working Principle

4.1 Architecture Diagram

Producer : Sends messages to Kafka.

Consumer : Receives messages from Kafka and processes them.

Consumer Group (CG) : A set of consumers where each consumer handles different partitions; groups operate independently.

Broker : A Kafka server instance; one or more brokers form a cluster.

Controller : One broker elected to manage partition and replica state, handling leader election and metadata updates.

When a leader replica fails, the controller elects a new leader.

When ISR (in‑sync replica) set changes, the controller notifies brokers.

When a topic’s partition count increases, the controller reassigns partitions.

Topic : Logical queue; producers write to a topic, consumers read from it.

Partition : Ordered log within a topic; a topic can have multiple partitions to increase concurrency.

Each partition is an append‑only log file with a unique offset. Offsets guarantee ordering within a partition but not across partitions.

4.2 Write Flow

Connect to Zookeeper (Kafka 2.8+ no longer requires Zookeeper) to obtain partition and leader info.

Send the message to the appropriate broker.

Specify topic, value, and optionally partition and key.

Serialize the message.

If a partition is specified, the partitioner does nothing; otherwise it selects a partition based on the key.

Batch messages locally; a background thread sends batches to the broker.

Broker acknowledges the write, returning topic, partition, and offset information on success, or an error on failure.

4.3 Read Flow

Connect to Zookeeper to discover partition and leader info.

Connect to the leader broker.

Consumer requests the desired topic, partition, and offset.

Leader locates the segment (index and log files) based on the offset.

Using the index, the leader reads the log at the offset and returns the data to the consumer.

5. Kafka Operations

5.1 Topic Management Commands

# Create a topic
kafka-topics.sh --create --zookeeper localhost:2181 \
    --replication-factor 3 --partitions 3 --topic test

# Expand partitions
kafka-topics.sh --zookeeper localhost:2181 --alter \
    --topic test --partitions 4

# Delete a topic
kafka-topics.sh --delete --zookeeper localhost:2181 --topic test

# Describe a topic
kafka-topics.sh --topic event_topic --zookeeper localhost:2181 --describe

5.2 Rebalancing After Adding/Removing Nodes

# Generate reassignment plan
./kafka-reassign-partitions.sh --zookeeper localhost:2181 \
    --topics-to-move-json-file mv.json --broker-list "1001,1002" --generate

# Execute reassignment
./kafka-reassign-partitions.sh --zookeeper localhost:2181 \
    --reassignment-json-file reassignment.json --execute

5.3 Consumer Group Commands

# Show consumption status
./kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 \
    --describe --group test-group

# Delete a consumer group
./kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 \
    --delete --group test-group

# Reset offsets (earliest, latest, specific, etc.)
bin/kafka-consumer-groups.sh --bootstrap-server kafka-host:port \
    --group test-group --reset-offsets --all-topics \
    --to-earliest --execute

5.4 Set Topic Retention Time

# Set retention to 1 hour (3600000 ms)
./bin/kafka-configs.sh --zookeeper 127.0.0.1:2181 \
    --alter --entity-name my_topic --entity-type topics \
    --add-config retention.ms=3600000

# View topic configuration
./bin/kafka-configs.sh --zookeeper 127.0.0.1:2181 \
    --describe --entity-name my_topic --entity-type topics

6. Common Performance Tuning

6.1 Disk Directory Optimization

Distribute partitions across different disks to avoid contention; placing many partitions on the same disk reduces sequential I/O performance.

6.2 JVM Parameter Configuration

Prefer the G1 garbage collector over CMS; use at least JDK 1.7u51. G1 offers better throughput‑latency balance, region‑based memory management, and configurable pause targets.

G1 is suitable for large heap sizes (≥4 GB) and applications with frequent allocation/deallocation.

It unifies young and old generation collection, reducing fragmentation.

Example JVM options (illustrated in the original diagram):

6.3 Log Flush Strategy

log.flush.interval.messages = 100000

: Flush to disk after every 100 000 messages. log.flush.interval.ms = 1000: Flush every second.

6.4 Log Retention Time

Kafka retains data for 7 days by default (log.retention.hours = 168). Adjust this setting to prevent disk exhaustion when handling massive message volumes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Operations message queues Kafka Performance tuning

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.