Common Causes of Kafka Message Loss and Mitigation Strategies
This article examines the typical reasons Kafka messages are lost across producers, brokers, and consumers, and provides detailed configuration recommendations and best‑practice solutions to significantly reduce the risk of data loss in distributed streaming systems.
Source
https://www.cnblogs.com/fhey/p/18258981
Preface
Kafka message loss usually involves multiple aspects, including the configuration and behavior of the producer, consumer, and Kafka broker. The following sections analyze common causes and provide solutions and best practices.
1. Scenarios Where the Producer Causes Message Loss
Scenario 1: Message Payload Too Large
The message size exceeds the broker's message.max.bytes value, causing the broker to return an error.
Solution:
1. Reduce the size of the producer's message payload
Compress the payload or remove unnecessary fields to shrink the message.
2. Adjust the max.request.size parameter
max.request.size defines the maximum size of a single message sent by the producer (default 1 MB). This value must be smaller than the broker's message.max.bytes .
Scenario 2: Asynchronous Send Mechanism
The Kafka producer uses asynchronous sending by default; if the send result is not handled correctly, messages may be lost.
Solution:
1. Use a send method with a callback
Instead of producer.send(msg) , use producer.send(msg, callback) . The callback allows retry handling for failed sends.
Scenario 3: Network Issues and Misconfiguration
Network jitter or interruption can prevent messages from reaching the broker. If the producer does not configure appropriate retry ( retries ) and acknowledgment ( acks ) parameters, messages may be lost during unstable network conditions.
Solution:
1. Set acks to "all"
The acks parameter determines how many replica acknowledgments are required before the producer considers the write successful. Options:
all/-1: All in‑sync replicas must acknowledge.
0: The client does not wait for any acknowledgment.
1: The leader replica must acknowledge.
Using synchronous sends or ensuring acks is set to "all" guarantees that all replicas receive the message.
2. Configure retry parameters
Key retry parameters are retries and retry.backoff.ms .
(1) retries specifies the number of retry attempts (default 0). A typical robust setting is:
retries = Integer.MAX_VALUE
max.in.flight.requests.per.connection = 1(2) retry.backoff.ms defines the interval between retries (default 100 ms). Adjust it based on expected recovery time.
3. Set min.insync.replicas
Configure min.insync.replicas to at least 2 (default is 1) so that the producer receives an error if fewer than two replicas acknowledge, preventing silent loss.
2. Scenarios Where the Broker Causes Message Loss
Scenario 1: Broker Crash
Kafka writes messages to the page cache first and flushes them to disk asynchronously. If the leader broker crashes before flushing and no follower can take over, unflushed messages are lost.
Solution:
1. Increase replica count
Set replication.factor ≥ 3 and configure min.insync.replicas ≥ 2 so that at least two brokers store each message.
Scenario 2: Leader Failure with Unsynced Followers
If the leader broker fails and a follower that has not fully synchronized becomes the new leader, some messages may be missing.
Solution:
1. Control leader election eligibility
Set unclean.leader.election.enable to false to prevent out‑of‑ISR replicas from becoming leader.
2. Increase replica count
Same as Scenario 1.
Scenario 3: Persistence Errors
Broker persists data to the page cache and flushes it to disk based on size or time intervals. If the OS crashes before flushing, data is lost.
Solution:
1. Adjust flush parameters
Kafka provides the following flush settings (officially not recommended to force flush, as replication should ensure reliability):
log.flush.interval.messages : Number of messages per flush (default Long.MAX_VALUE ).
log.flush.interval.ms : Time interval between flushes (default null ).
log.flush.scheduler.interval.ms : Scheduler interval for periodic flushes (default Long.MAX_VALUE ).
2. Increase replica count
Same as Scenario 1.
3. Scenarios Where the Consumer Causes Message Loss
Scenario 1: Offset Commit After Processing Failure
The enable.auto.commit parameter defaults to true , automatically committing offsets. If processing fails after the commit, the message is lost.
Solution:
Set enable.auto.commit to false and commit offsets manually after successful processing.
Scenario 2: Concurrent Consumption
Multi‑threaded consumption can cause offset update conflicts, leading to lost messages.
Solution:
Prefer single‑threaded consumption for loss‑sensitive workloads, or disable auto‑commit and manage offsets manually when using multiple threads.
Scenario 3: Message Backlog
If the consumer cannot keep up with the production rate, messages accumulate and may be dropped by the consumer's flow‑control mechanism.
Solution:
Improve consumer processing speed, e.g., offload heavy logic to separate threads.
Scenario 4: Consumer Group Rebalance
Rebalance can cause loss in two ways: a client timed out and left the group, or offsets were not committed during rebalance.
Solution:
1. Increase consumption speed
Process each message faster; use asynchronous handling or multithreading where appropriate.
2. Tune parameters to avoid unnecessary rebalance
Key parameters:
session.timeout.ms : Heartbeat timeout for detecting failed consumers.
max.poll.interval.ms : Maximum interval between polls (default 5 min). Increase to reduce rebalance frequency.
heartbeat.interval.ms : Must be smaller than session.timeout.ms to keep the session alive.
max.poll.records : Maximum records returned per poll (default 500). Adjust based on message rate.
Scenarios Where Messages May Still Be Lost
Scenario 1
If the page cache holds a large amount of data and acknowledges success to the producer before flushing to disk, a system crash can cause loss.
Scenario 2
If the disks of all replica servers fail, data will be lost.
Summary
In summary, Kafka message loss is a multi‑layer issue involving producers, brokers, and consumers. By applying proper configurations, monitoring, and mitigation strategies, the risk of loss can be greatly reduced, ensuring reliable data transmission in distributed systems.
Java Captain
Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.