Mastering Kafka in Production: Boost Throughput, Ensure Reliability, and Avoid Data Loss
This article shares practical Kafka production insights, covering architecture overview, producer throughput tuning, message loss prevention, broker and consumer configurations, duplicate consumption avoidance, backlog mitigation, ordering guarantees, and the mechanics of consumer group rebalancing, helping engineers build stable, high‑performance streaming pipelines.
1. Background
Integrating Kafka has become standard for high‑concurrency systems, but moving from a basic "it works" setup to a stable, efficient deployment involves many pitfalls. The article draws on real‑world project experience to discuss message loss prevention, duplicate consumption control, performance bottleneck optimization, cluster operation strategies, and design considerations for topics, partitions, and replicas.
2. How to Increase Producer Throughput?
The producer sends messages to the broker using two threads: a main thread that creates a
RecordAccumulator(a double‑ended queue) and a Sender thread that pulls records from the accumulator and transmits them.
Four producer configuration parameters can be tuned:
batch.size: maximum size of a batch sent to the broker (default 16 KB). Increasing it raises throughput but may add latency.
linger.ms: time the sender waits for a batch to fill before sending (default 0 ms). In production a value of 5‑100 ms is recommended.
buffer.memory: total memory allocated to the
RecordAccumulator(default 32 MB). Raising it allows larger buffers.
compression.type: compression algorithm for producer data (none, gzip, snappy, lz4, zstd). Compression reduces network and disk I/O.
These settings work like a bus service:
batch.sizeand
linger.msdecide when the bus departs,
buffer.memorydetermines how many passengers can wait, and
compression.typelets more passengers share a seat.
3. How to Ensure No Message Loss?
3.1 Producer‑Side Loss
Because the Kafka producer sends asynchronously, calling
producer.send(msg)returns immediately without guaranteeing delivery. Use the callback‑enabled API
producer.send(msg, callback)to know whether the send succeeded.
Enable retries (
retries) so that transient network glitches trigger automatic resend attempts.
3.2 Broker‑Side Loss
Set
acks=allso that the leader and all in‑sync replicas must acknowledge the write before the producer receives a response.
acks: 0 (no ack), 1 (leader ack), -1/all (all ISR ack).
Configure the broker to avoid unclean leader election:
unclean.leader.election.enable=false.
Use a replication factor of at least 3 and ensure
replication.factor > min.insync.replicas(e.g.,
replication.factor = min.insync.replicas + 1) so that losing a single replica does not cause data loss.
3.3 Consumer‑Side Loss
Consumers track their position with an offset. If auto‑commit is enabled and a failure occurs before the offset is persisted, messages after the last committed offset can be lost.
Disable automatic offset commits (
enable.auto.commit=false) and commit offsets manually after processing. Use
auto.commit.interval.ms(default 5 s) only when auto‑commit is desired.
<code>try { while (true) { ConsumerRecords<String, String> records = kafkaConsumer.poll(Duration.ofSeconds(1)); process(records); kafkaConsumer.commitAsync(); } } catch (Exception e) { handle(e); } finally { try { kafkaConsumer.commitSync(); } finally { kafkaConsumer.close(); } }</code>This pattern combines asynchronous commits for regular operation with a final synchronous commit before shutdown, ensuring the last offset is safely stored.
If a
CommitFailedExceptionoccurs, it usually means a rebalance happened between processing and committing.
4. How to Prevent Duplicate Consumption?
Enable producer idempotence (
enable.idempotence=true) to avoid duplicate sends caused by retries.
On the consumer side, ensure that the processed offset matches the committed offset, and consider using unique keys or distributed locks to achieve idempotent processing.
5. How to Resolve Message Backlog?
If consumption capacity is insufficient, increase the number of partitions and scale the consumer group accordingly (consumers = partitions).
If downstream processing is slow, increase the fetch size (e.g., from 500 to 1000 messages per poll) to keep consumption ahead of production.
6. How to Preserve Message Order?
Producer: avoid acks=0, disable retries, and use synchronous sends so each message is confirmed before the next is sent.
Consumer: each partition is consumed by a single consumer instance, guaranteeing order within that partition, though this may reduce overall throughput.
7. What Is a Rebalance?
A rebalance is the process where all consumers in a group agree on the assignment of partitions. During rebalance no consumer can read messages, which can impact TPS.
Increasing the number of partitions for a subscribed topic triggers a rebalance.
Changing the set of topics a group subscribes to triggers a rebalance.
Adding or removing consumer instances triggers a rebalance.
Consumers send heartbeats to the coordinator; if a heartbeat is missed for
session.timeout.ms(default 145 s), the coordinator assumes the consumer is dead and starts a rebalance.
The
max.poll.interval.mssetting limits the maximum time between successive
poll()calls; exceeding it also forces the consumer to leave the group and triggers a rebalance.
8. Why Is Kafka So Fast for Messaging?
Message partitioning allows parallel processing across many brokers.
Sequential disk reads/writes (log‑structured storage) improve I/O efficiency.
Page cache keeps recent data in memory, turning disk access into memory access.
Zero‑copy transfers reduce context switches and data copying.
Batching groups many messages into a single network request.
Compression reduces the amount of data written to disk and sent over the network.
Instant Consumer Technology Team
Instant Consumer Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.