Backend Development 9 min read

Root Cause Analysis and Resolution of Data Inconsistency in Transactional MQ Processing

This article details a real‑world investigation of intermittent refund order failures caused by improper handling of transactions and message queues, explains the step‑by‑step debugging process, identifies large‑transaction timing issues, and presents a concrete fix that moves MQ sending until after transaction commit.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Root Cause Analysis and Resolution of Data Inconsistency in Transactional MQ Processing

1 Conclusion

Conclusion first: the usage pattern of transactions combined with MQ must be correct; a slight mistake can easily lead to data inconsistency.

2 Problem Background and Symptoms

In a commercial refund system, several refund orders fail each week because the cost field update fails. The workflow is roughly: System A processes an order, sends an MQ, System B consumes the MQ, pulls information from System A, and updates its own table.

3 Investigation Process

3.1 Initial Analysis

During code review of System B it was found that MQ consumption lacked locking, allowing two MQ messages (tags create and update ) from System A to be processed concurrently. The cost field is updated in the update branch, and without proper ordering the create branch can overwrite the value, causing the update to fail.

3.2 Problem Reappears

Two weeks later the same failure re‑occurred. Logs showed that System B, while consuming an update MQ, read stale data (cost=0) because System A had already sent the MQ before its transaction committed, leading System B to apply an outdated value.

3.2.1 Analysis Idea 1

Considered master‑slave replication lag, but System A’s cluster has no slave, so this was ruled out.

3.2.2 Analysis Idea 2

Checked transaction isolation level (repeatable‑read) and confirmed it was not the cause.

3.2.3 Analysis Idea 3

Architecture lead pointed out a large transaction issue: the updateCharge(CpsOrderDO cpsOrderDO) method sends MQ inside a multi‑layer transaction, so the MQ may be sent before the outer transaction commits. The updateCostSuccess interface also performs three steps within the same large transaction, delaying commit.

Further diagrams illustrate the flow.

3.3 Solution

Testing confirmed that delaying transaction commit after sending MQ reproduces the issue. The fix is to move MQ sending to after the transaction successfully commits. After deploying the fix, no further alerts were observed.

The complete process diagram is shown below.

4 Summary and Reflection

Experienced engineers may spot such issues quickly, but less experienced ones can miss subtle concurrency problems. Data inconsistency typically stems from data redundancy or poor concurrency control. Recommendations include: evaluate data redundancy, add monitoring for critical consistency checks, review large‑transaction + MQ code thoroughly, use transactional MQ to keep logic and MQ outcomes consistent, and avoid wrapping whole methods in a single large transaction.

Author Introduction

Yang Ying, backend engineer at ZhaiZhai, responsible for commercial B‑side systems such as lead management, customer operations, sales operation management, and ad publishing.

backendtransactionconcurrencydata consistencyMessage Queuelarge transaction
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.