Databases 6 min read

OceanBase Timeout During Merge: Diagnosis, Emergency Handling, and Optimization

This article details a timeout incident in an OceanBase cluster during a merge operation, explains the emergency suspension and resumption steps, analyzes log and metric data to identify queue backlog and disk I/O saturation as root causes, and offers practical optimization recommendations.

Aikesheng Open Source Community

Feb 11, 2025

OceanBase Timeout During Merge: Diagnosis, Emergency Handling, and Optimization

1 Problem Background

At around 04:25, the OceanBase cluster reported a java.sql.SQLException: Timeout error on the business application side. OCP alerts showed a large number of easy_connection_on_timeout_conn warnings. The batch SQL tasks were scheduled during this period, but the cluster was performing a merge operation.

2 Emergency Plan

Because batch tasks have higher priority, the merge operation was paused. Around 05:50 the merge was suspended, allowing batch jobs to resume normally.

-- sys tenant execution
ALTER SYSTEM SUSPEND MERGE;

After the batch completed, the merge was resumed.

-- sys tenant execution
ALTER SYSTEM RESUME MERGE

3 Problem Investigation

After the emergency actions, the root cause was investigated.

1. Check observer.log

Filtered the observer log for the relevant time window: grep -i "sending error packet" observer.log The log showed entries indicating transaction timeout and rollback, with error code err=-4012\-6224 representing these conditions.

2. Confirm Queue Backlog

grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645
# optional clearer view
grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645 | sed 's/,/,
/g' | grep req_queue
grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645 | sed 's/,/,
/g' | grep multi_level_queue

Key metrics such as req_queue:total_size, multi_level_queue:total_size, group_id = *, and queue_size were examined; non‑zero values indicated backlog.

Conclusion: The direct cause of the SQL timeout was tenant queue backlog.

3. Check tsar Logs

tsar -d 20241010  -i 1

Network retransmission rate on the alert node exceeded 0.2, which contributed to the large number of easy_connection_on_timeout_conn alerts.

Disk sdb (the OB data disk) usage reached 100% between 04:20‑04:30, causing I/O saturation and queue buildup.

Conclusion

During the merge window, disk I/O was fully occupied. Concurrent batch jobs added further pressure, leading to queue accumulation. OceanBase’s RPC ack_timeout is set to 10 seconds; connections exceeding this are dropped, manifesting as SQL response timeouts.

4 Optimization Suggestions

Adjust daily merge schedule to avoid overlapping with batch jobs.

Merges increase disk I/O; batch tasks also consume resources, causing performance bottlenecks.

Recommend separating merge and batch operations.

Reduce batch concurrency; run tasks sequentially to lower system load.

Consider business segmentation to isolate heavy workloads such as batch, merge, and backup.

References

[1] Queue field information: https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000819396

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

OceanBase Queue backlog SQL timeout

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.