Databases 6 min read

OceanBase Timeout During Merge: Diagnosis, Emergency Handling, and Optimization

This article details a timeout incident in an OceanBase cluster during a merge operation, explains the emergency suspension and resumption steps, analyzes log and metric data to identify queue backlog and disk I/O saturation as root causes, and offers practical optimization recommendations.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
OceanBase Timeout During Merge: Diagnosis, Emergency Handling, and Optimization

1 Problem Background

At around 04:25, the OceanBase cluster reported a java.sql.SQLException: Timeout error on the business application side. OCP alerts showed a large number of easy_connection_on_timeout_conn warnings. The batch SQL tasks were scheduled during this period, but the cluster was performing a merge operation.

2 Emergency Plan

Because batch tasks have higher priority, the merge operation was paused. Around 05:50 the merge was suspended, allowing batch jobs to resume normally.

-- sys tenant execution
ALTER SYSTEM SUSPEND MERGE;

After the batch completed, the merge was resumed.

-- sys tenant execution
ALTER SYSTEM RESUME MERGE

3 Problem Investigation

After the emergency actions, the root cause was investigated.

1. Check observer.log

Filtered the observer log for the relevant time window:

grep -i "sending error packet" observer.log

The log showed entries indicating transaction timeout and rollback, with error code err=-4012\-6224 representing these conditions.

2. Confirm Queue Backlog

grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645
# optional clearer view
grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645 | sed 's/,/,
/g' | grep req_queue
grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645 | sed 's/,/,
/g' | grep multi_level_queue

Key metrics such as req_queue:total_size , multi_level_queue:total_size , group_id = * , and queue_size were examined; non‑zero values indicated backlog.

Conclusion: The direct cause of the SQL timeout was tenant queue backlog.

3. Check tsar Logs

tsar -d 20241010  -i 1

Network retransmission rate on the alert node exceeded 0.2, which contributed to the large number of easy_connection_on_timeout_conn alerts.

Disk sdb (the OB data disk) usage reached 100% between 04:20‑04:30, causing I/O saturation and queue buildup.

Conclusion

During the merge window, disk I/O was fully occupied. Concurrent batch jobs added further pressure, leading to queue accumulation. OceanBase’s RPC ack_timeout is set to 10 seconds; connections exceeding this are dropped, manifesting as SQL response timeouts.

4 Optimization Suggestions

Adjust daily merge schedule to avoid overlapping with batch jobs. Merges increase disk I/O; batch tasks also consume resources, causing performance bottlenecks. Recommend separating merge and batch operations.

Reduce batch concurrency; run tasks sequentially to lower system load.

Consider business segmentation to isolate heavy workloads such as batch, merge, and backup.

References

[1] Queue field information: https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000819396

performanceOceanBaseDatabase TroubleshootingQueue backlogSQL timeout
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.