Root Cause Analysis and Resolution of Data Inconsistency in Transactional MQ Processing
This article details a real‑world investigation of intermittent refund order failures caused by improper handling of transactions and message queues, explains the step‑by‑step debugging process, identifies large‑transaction timing issues, and presents a concrete fix that moves MQ sending until after transaction commit.
1 Conclusion
Conclusion first: the usage pattern of transactions combined with MQ must be correct; a slight mistake can easily lead to data inconsistency.
2 Problem Background and Symptoms
In a commercial refund system, several refund orders fail each week because the cost field update fails. The workflow is roughly: System A processes an order, sends an MQ, System B consumes the MQ, pulls information from System A, and updates its own table.
3 Investigation Process
3.1 Initial Analysis
During code review of System B it was found that MQ consumption lacked locking, allowing two MQ messages (tags create and update ) from System A to be processed concurrently. The cost field is updated in the update branch, and without proper ordering the create branch can overwrite the value, causing the update to fail.
3.2 Problem Reappears
Two weeks later the same failure re‑occurred. Logs showed that System B, while consuming an update MQ, read stale data (cost=0) because System A had already sent the MQ before its transaction committed, leading System B to apply an outdated value.
3.2.1 Analysis Idea 1
Considered master‑slave replication lag, but System A’s cluster has no slave, so this was ruled out.
3.2.2 Analysis Idea 2
Checked transaction isolation level (repeatable‑read) and confirmed it was not the cause.
3.2.3 Analysis Idea 3
Architecture lead pointed out a large transaction issue: the updateCharge(CpsOrderDO cpsOrderDO) method sends MQ inside a multi‑layer transaction, so the MQ may be sent before the outer transaction commits. The updateCostSuccess interface also performs three steps within the same large transaction, delaying commit.
Further diagrams illustrate the flow.
3.3 Solution
Testing confirmed that delaying transaction commit after sending MQ reproduces the issue. The fix is to move MQ sending to after the transaction successfully commits. After deploying the fix, no further alerts were observed.
The complete process diagram is shown below.
4 Summary and Reflection
Experienced engineers may spot such issues quickly, but less experienced ones can miss subtle concurrency problems. Data inconsistency typically stems from data redundancy or poor concurrency control. Recommendations include: evaluate data redundancy, add monitoring for critical consistency checks, review large‑transaction + MQ code thoroughly, use transactional MQ to keep logic and MQ outcomes consistent, and avoid wrapping whole methods in a single large transaction.
Author Introduction
Yang Ying, backend engineer at ZhaiZhai, responsible for commercial B‑side systems such as lead management, customer operations, sales operation management, and ad publishing.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.