Backend Development 15 min read

Common Causes of Kafka Message Loss and Mitigation Strategies

This article examines the typical reasons Kafka messages are lost across producers, brokers, and consumers, and provides detailed configuration recommendations and best‑practice solutions to significantly reduce the risk of data loss in distributed streaming systems.

Java Captain
Java Captain
Java Captain
Common Causes of Kafka Message Loss and Mitigation Strategies

Source

https://www.cnblogs.com/fhey/p/18258981

Preface

Kafka message loss usually involves multiple aspects, including the configuration and behavior of the producer, consumer, and Kafka broker. The following sections analyze common causes and provide solutions and best practices.

1. Scenarios Where the Producer Causes Message Loss

Scenario 1: Message Payload Too Large

The message size exceeds the broker's message.max.bytes value, causing the broker to return an error.

Solution:

1. Reduce the size of the producer's message payload

Compress the payload or remove unnecessary fields to shrink the message.

2. Adjust the max.request.size parameter

max.request.size defines the maximum size of a single message sent by the producer (default 1 MB). This value must be smaller than the broker's message.max.bytes .

Scenario 2: Asynchronous Send Mechanism

The Kafka producer uses asynchronous sending by default; if the send result is not handled correctly, messages may be lost.

Solution:

1. Use a send method with a callback

Instead of producer.send(msg) , use producer.send(msg, callback) . The callback allows retry handling for failed sends.

Scenario 3: Network Issues and Misconfiguration

Network jitter or interruption can prevent messages from reaching the broker. If the producer does not configure appropriate retry ( retries ) and acknowledgment ( acks ) parameters, messages may be lost during unstable network conditions.

Solution:

1. Set acks to "all"

The acks parameter determines how many replica acknowledgments are required before the producer considers the write successful. Options:

all/-1: All in‑sync replicas must acknowledge.

0: The client does not wait for any acknowledgment.

1: The leader replica must acknowledge.

Using synchronous sends or ensuring acks is set to "all" guarantees that all replicas receive the message.

2. Configure retry parameters

Key retry parameters are retries and retry.backoff.ms .

(1) retries specifies the number of retry attempts (default 0). A typical robust setting is:

retries = Integer.MAX_VALUE
max.in.flight.requests.per.connection = 1

(2) retry.backoff.ms defines the interval between retries (default 100 ms). Adjust it based on expected recovery time.

3. Set min.insync.replicas

Configure min.insync.replicas to at least 2 (default is 1) so that the producer receives an error if fewer than two replicas acknowledge, preventing silent loss.

2. Scenarios Where the Broker Causes Message Loss

Scenario 1: Broker Crash

Kafka writes messages to the page cache first and flushes them to disk asynchronously. If the leader broker crashes before flushing and no follower can take over, unflushed messages are lost.

Solution:

1. Increase replica count

Set replication.factor ≥ 3 and configure min.insync.replicas ≥ 2 so that at least two brokers store each message.

Scenario 2: Leader Failure with Unsynced Followers

If the leader broker fails and a follower that has not fully synchronized becomes the new leader, some messages may be missing.

Solution:

1. Control leader election eligibility

Set unclean.leader.election.enable to false to prevent out‑of‑ISR replicas from becoming leader.

2. Increase replica count

Same as Scenario 1.

Scenario 3: Persistence Errors

Broker persists data to the page cache and flushes it to disk based on size or time intervals. If the OS crashes before flushing, data is lost.

Solution:

1. Adjust flush parameters

Kafka provides the following flush settings (officially not recommended to force flush, as replication should ensure reliability):

log.flush.interval.messages : Number of messages per flush (default Long.MAX_VALUE ).

log.flush.interval.ms : Time interval between flushes (default null ).

log.flush.scheduler.interval.ms : Scheduler interval for periodic flushes (default Long.MAX_VALUE ).

2. Increase replica count

Same as Scenario 1.

3. Scenarios Where the Consumer Causes Message Loss

Scenario 1: Offset Commit After Processing Failure

The enable.auto.commit parameter defaults to true , automatically committing offsets. If processing fails after the commit, the message is lost.

Solution:

Set enable.auto.commit to false and commit offsets manually after successful processing.

Scenario 2: Concurrent Consumption

Multi‑threaded consumption can cause offset update conflicts, leading to lost messages.

Solution:

Prefer single‑threaded consumption for loss‑sensitive workloads, or disable auto‑commit and manage offsets manually when using multiple threads.

Scenario 3: Message Backlog

If the consumer cannot keep up with the production rate, messages accumulate and may be dropped by the consumer's flow‑control mechanism.

Solution:

Improve consumer processing speed, e.g., offload heavy logic to separate threads.

Scenario 4: Consumer Group Rebalance

Rebalance can cause loss in two ways: a client timed out and left the group, or offsets were not committed during rebalance.

Solution:

1. Increase consumption speed

Process each message faster; use asynchronous handling or multithreading where appropriate.

2. Tune parameters to avoid unnecessary rebalance

Key parameters:

session.timeout.ms : Heartbeat timeout for detecting failed consumers.

max.poll.interval.ms : Maximum interval between polls (default 5 min). Increase to reduce rebalance frequency.

heartbeat.interval.ms : Must be smaller than session.timeout.ms to keep the session alive.

max.poll.records : Maximum records returned per poll (default 500). Adjust based on message rate.

Scenarios Where Messages May Still Be Lost

Scenario 1

If the page cache holds a large amount of data and acknowledges success to the producer before flushing to disk, a system crash can cause loss.

Scenario 2

If the disks of all replica servers fail, data will be lost.

Summary

In summary, Kafka message loss is a multi‑layer issue involving producers, brokers, and consumers. By applying proper configurations, monitoring, and mitigation strategies, the risk of loss can be greatly reduced, ensuring reliable data transmission in distributed systems.

ConfigurationKafkareliabilityConsumerBrokerproducerMessage Loss
Java Captain
Written by

Java Captain

Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.