Databases 11 min read

Analysis and Fix of a Redis Replication Offset Bug Caused by Full‑Sync Interference

This article details a deep‑rooted Redis replication bug where slave offsets become larger than the master after a network interruption, explains the underlying cause in the full‑sync process, describes the code changes made to fix it, and offers testing and prevention guidance.

Baidu Intelligent Testing
Baidu Intelligent Testing
Baidu Intelligent Testing
Analysis and Fix of a Redis Replication Offset Bug Caused by Full‑Sync Interference

Background

On July 31, 2015, a TOR failure caused a one‑hour network outage; after recovery, several Redis slaves showed offsets larger than the master, except for slave0.

Redis uses offset values for incremental synchronization: a slave sends its current offset to the master, which then sends data after that offset. An offset larger than the master indicates a synchronization logic error.

Impact

Redis replication has three sync modes:

Full sync (initial connection or incremental sync failure) – master dumps all data to the slave.

Incremental sync (brief connection loss) – master sends data after the slave’s reported offset.

Long‑connection sync (normal operation) – master streams all write requests to the slave.

The bug affects only incremental sync, leading to two scenarios:

After the issue, the slave’s offset is larger than the master’s during incremental sync – incremental sync fails, triggering a full sync; slave data remains unaffected.

After the issue, the slave’s offset is smaller than the master’s during incremental sync – the slave ends up with fewer records than the master.

Bug Localization

The anomaly originates from the replica_offset handling in the slave. The initialization and update logic are illustrated in the following diagrams:

When a slave receives a "+FULLRESYNC" response, it updates its offset accordingly.

During normal long‑connection sync, the offset increments with each request:

Combining the initialization and update logic shows that the offset anomaly is triggered by full syncs, because:

During the outage, the master‑slave connection is broken, so normal sync cannot cause the anomaly.

In normal sync, the slave offset can only be smaller, never larger, than the master.

Source Code Analysis

The master’s incremental‑sync handling is simplified in the diagram below:

The master first sets psync_offset to its current offset, then checks for an ongoing BGSAVE. If another slave is already performing BGSAVE, the master copies that slave’s output buffer to the new slave, which leads to the bug:

When the first slave performs a full sync, the master returns its current offset o1 and starts BGSAVE, writing new writes to repl_buffer . If a second slave requests a full sync before the first BGSAVE finishes, the master returns a larger offset o2 (where o2 > o1 ) and reuses the first slave’s buffer. Consequently, the second slave’s offset becomes larger than the master’s.

Fix Implementation

After identifying the root cause, developers changed the code so that during a full sync the master no longer returns the current offset; instead it returns the offset saved at the start of BGSAVE, stored in server.fullsync_repl_offset . This ensures all slaves receive the same offset.

Testing

Post‑fix regression testing verifies the master‑slave sync flow. The diagram below shows the updated logic:

If a slave requests a full sync while the master is already performing BGSAVE, the master marks the slave as WAIT_BGSAVE_START and later resumes sync after BGSAVE completes, ensuring the slave’s offset is not larger than the master’s.

New Issue

A related problem occurs when the master is already in BGSAVE:

The slave requests a full sync; the master returns offset o1 and marks the slave WAIT_BGSAVE_START .

After BGSAVE finishes, the master starts a new BGSAVE, now at offset o2 , and writes incremental data to the slave’s buffer.

After dumping data, the master’s offset becomes o2 + inc , while the slave’s offset is o1 + inc , with o1 < o2 .

This makes the slave’s offset smaller than the master’s, causing data inconsistency after a brief network loss and duplicate writes.

Resolution from Redis Community

The Redis maintainers provided a fix (see commit ) that changes the full‑sync response: the master no longer returns its current offset immediately but buffers it as incremental data.

Conclusion

This hidden, large‑scale bug demonstrates how difficult it is to detect low‑level service issues. Even with extensive automated test suites (over 200 cases), such bugs can persist. Effective code review (CR) and thorough testing are essential; reviewers should map out execution flows, examine variable initialization, memory management, and branch logic to uncover subtle defects.

Testing Tips

Increase CR effort and understand the code deeply to write comprehensive test cases.

During CR, draw flowcharts (at least mentally) to analyze logic paths.

Check variable initialization, possible dirty values, dynamic memory allocation/release, and branch correctness.

Bug Localization Tips

Identify the exact variable or function causing the issue.

Read the code to locate initialization and call sites.

Analyze why a variable or function may fail and verify the corresponding logic.

backendDatabaseRedisReplicationBugFixoffsetFullSync
Baidu Intelligent Testing
Written by

Baidu Intelligent Testing

Welcome to follow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.