Big Data 15 min read

Mastering Kafka in Production: Boost Throughput, Ensure Reliability, and Avoid Data Loss

This article shares practical Kafka production insights, covering architecture overview, producer throughput tuning, message loss prevention, broker and consumer configurations, duplicate consumption avoidance, backlog mitigation, ordering guarantees, and the mechanics of consumer group rebalancing, helping engineers build stable, high‑performance streaming pipelines.

Instant Consumer Technology Team

Jun 5, 2025

Mastering Kafka in Production: Boost Throughput, Ensure Reliability, and Avoid Data Loss

1. Background

Integrating Kafka has become standard for high‑concurrency systems, but moving from a basic "it works" setup to a stable, efficient deployment involves many pitfalls. The article draws on real‑world project experience to discuss message loss prevention, duplicate consumption control, performance bottleneck optimization, cluster operation strategies, and design considerations for topics, partitions, and replicas.

2. How to Increase Producer Throughput?

The producer sends messages to the broker using two threads: a main thread that creates a RecordAccumulator (a double‑ended queue) and a Sender thread that pulls records from the accumulator and transmits them.

Four producer configuration parameters can be tuned: batch.size: maximum size of a batch sent to the broker (default 16 KB). Increasing it raises throughput but may add latency. linger.ms: time the sender waits for a batch to fill before sending (default 0 ms). In production a value of 5‑100 ms is recommended. buffer.memory: total memory allocated to the RecordAccumulator (default 32 MB). Raising it allows larger buffers. compression.type: compression algorithm for producer data (none, gzip, snappy, lz4, zstd). Compression reduces network and disk I/O.

These settings work like a bus service: batch.size and linger.ms decide when the bus departs, buffer.memory determines how many passengers can wait, and compression.type lets more passengers share a seat.

3. How to Ensure No Message Loss?

3.1 Producer‑Side Loss

Because the Kafka producer sends asynchronously, calling producer.send(msg) returns immediately without guaranteeing delivery. Use the callback‑enabled API producer.send(msg, callback) to know whether the send succeeded.

Enable retries ( retries) so that transient network glitches trigger automatic resend attempts.

3.2 Broker‑Side Loss

Set acks=all so that the leader and all in‑sync replicas must acknowledge the write before the producer receives a response. acks: 0 (no ack), 1 (leader ack), -1/all (all ISR ack).

Configure the broker to avoid unclean leader election: unclean.leader.election.enable=false.

Use a replication factor of at least 3 and ensure replication.factor > min.insync.replicas (e.g., replication.factor = min.insync.replicas + 1) so that losing a single replica does not cause data loss.

3.3 Consumer‑Side Loss

Consumers track their position with an offset. If auto‑commit is enabled and a failure occurs before the offset is persisted, messages after the last committed offset can be lost.

Disable automatic offset commits ( enable.auto.commit=false) and commit offsets manually after processing. Use auto.commit.interval.ms (default 5 s) only when auto‑commit is desired.

try { while (true) { ConsumerRecords<String, String> records = kafkaConsumer.poll(Duration.ofSeconds(1)); process(records); kafkaConsumer.commitAsync(); } } catch (Exception e) { handle(e); } finally { try { kafkaConsumer.commitSync(); } finally { kafkaConsumer.close(); } }

This pattern combines asynchronous commits for regular operation with a final synchronous commit before shutdown, ensuring the last offset is safely stored.

If a CommitFailedException occurs, it usually means a rebalance happened between processing and committing.

4. How to Prevent Duplicate Consumption?

Enable producer idempotence ( enable.idempotence=true) to avoid duplicate sends caused by retries.

On the consumer side, ensure that the processed offset matches the committed offset, and consider using unique keys or distributed locks to achieve idempotent processing.

5. How to Resolve Message Backlog?

If consumption capacity is insufficient, increase the number of partitions and scale the consumer group accordingly (consumers = partitions).

If downstream processing is slow, increase the fetch size (e.g., from 500 to 1000 messages per poll) to keep consumption ahead of production.

6. How to Preserve Message Order?

Producer: avoid acks=0, disable retries, and use synchronous sends so each message is confirmed before the next is sent.

Consumer: each partition is consumed by a single consumer instance, guaranteeing order within that partition, though this may reduce overall throughput.

7. What Is a Rebalance?

A rebalance is the process where all consumers in a group agree on the assignment of partitions. During rebalance no consumer can read messages, which can impact TPS.

Increasing the number of partitions for a subscribed topic triggers a rebalance.

Changing the set of topics a group subscribes to triggers a rebalance.

Adding or removing consumer instances triggers a rebalance.

Consumers send heartbeats to the coordinator; if a heartbeat is missed for session.timeout.ms (default 145 s), the coordinator assumes the consumer is dead and starts a rebalance.

The max.poll.interval.ms setting limits the maximum time between successive poll() calls; exceeding it also forces the consumer to leave the group and triggers a rebalance.

8. Why Is Kafka So Fast for Messaging?

Message partitioning allows parallel processing across many brokers.

Sequential disk reads/writes (log‑structured storage) improve I/O efficiency.

Page cache keeps recent data in memory, turning disk access into memory access.

Zero‑copy transfers reduce context switches and data copying.

Batching groups many messages into a single network request.

Compression reduces the amount of data written to disk and sent over the network.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Kafka Performance tuning Message Queue Reliability consumer-group

Written by

Instant Consumer Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.