OceanBase Timeout During Merge: Diagnosis, Emergency Handling, and Optimization
This article details a timeout incident in an OceanBase cluster during a merge operation, explains the emergency suspension and resumption steps, analyzes log and metric data to identify queue backlog and disk I/O saturation as root causes, and offers practical optimization recommendations.
1 Problem Background
At around 04:25, the OceanBase cluster reported a java.sql.SQLException: Timeout error on the business application side. OCP alerts showed a large number of easy_connection_on_timeout_conn warnings. The batch SQL tasks were scheduled during this period, but the cluster was performing a merge operation.
2 Emergency Plan
Because batch tasks have higher priority, the merge operation was paused. Around 05:50 the merge was suspended, allowing batch jobs to resume normally.
-- sys tenant execution
ALTER SYSTEM SUSPEND MERGE;After the batch completed, the merge was resumed.
-- sys tenant execution
ALTER SYSTEM RESUME MERGE3 Problem Investigation
After the emergency actions, the root cause was investigated.
1. Check observer.log
Filtered the observer log for the relevant time window: grep -i "sending error packet" observer.log The log showed entries indicating transaction timeout and rollback, with error code err=-4012\-6224 representing these conditions.
2. Confirm Queue Backlog
grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645
# optional clearer view
grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645 | sed 's/,/,
/g' | grep req_queue
grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645 | sed 's/,/,
/g' | grep multi_level_queueKey metrics such as req_queue:total_size, multi_level_queue:total_size, group_id = *, and queue_size were examined; non‑zero values indicated backlog.
Conclusion: The direct cause of the SQL timeout was tenant queue backlog.
3. Check tsar Logs
tsar -d 20241010 -i 1Network retransmission rate on the alert node exceeded 0.2, which contributed to the large number of easy_connection_on_timeout_conn alerts.
Disk sdb (the OB data disk) usage reached 100% between 04:20‑04:30, causing I/O saturation and queue buildup.
Conclusion
During the merge window, disk I/O was fully occupied. Concurrent batch jobs added further pressure, leading to queue accumulation. OceanBase’s RPC ack_timeout is set to 10 seconds; connections exceeding this are dropped, manifesting as SQL response timeouts.
4 Optimization Suggestions
Adjust daily merge schedule to avoid overlapping with batch jobs.
Merges increase disk I/O; batch tasks also consume resources, causing performance bottlenecks.
Recommend separating merge and batch operations.
Reduce batch concurrency; run tasks sequentially to lower system load.
Consider business segmentation to isolate heavy workloads such as batch, merge, and backup.
References
[1] Queue field information: https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000819396
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
