Troubleshooting OceanBase No-Leader Alerts Caused by Network Bandwidth Saturation
This article details a step‑by‑step investigation of daily OceanBase no‑leader alerts caused by network bandwidth saturation, covering log analysis, clock synchronization issues, RPC backlog, and provides practical solutions such as bandwidth expansion and backup throttling to restore cluster stability.
1 Problem Description
A production OceanBase cluster generates a "no leader" alarm around 07:00 each day, accompanied by brief service timeouts. The OCP alarm keyword is "eg leader lease is expired".
2 Analysis Process
2.1 Common Checks for No‑Leader Situation
Refer to the official documentation and first confirm that replicas are indeed in a no‑leader state, then investigate the following possible causes:
observer.log contains obvious error messages.
Clock drift.
Deleted tenant, table, or partition.
Majority of replicas down.
Network issues.
Clog module fails to recover logs.
High load.
Clog disk full.
2.2 Check RS Logs
Search the Root Service log for the keyword "clock between rs and server not sync":
grep "clock between rs and server not sync" rootservice.log.20240613072655The result shows a warning indicating a clock mismatch between the RS node and the server.
2.3 Check observer.log
Verify clock desynchronization in observer.log :
grep -i "clock diff time is too large" observer.log.20240613070304The warning confirms a large clock difference, prompting a check of network bandwidth pressure during the alarm period.
2.4 Examine tsar Logs
Run:
tsar -d 20240613 -i 1The output shows outbound network traffic roughly ten times higher than normal.
2.5 Check RPC Message Backlog
Search for large "request doing" values in observer.log to detect RPC backlog:
grep 'RPC EASY STAT' observer.log.20240613070304 | awk -F 'request doing=' '{print $2}'Some values reach the thousands, indicating significant RPC message accumulation.
2.6 Verify Network Interfaces
Use ip link or ifconfig to list interfaces and confirm that bond0 and bond1 are independent NICs.
2.7 Check NIC Speed
Run:
ethtool bond0The speed is reported as 10000Mb/s, confirming a 10 Gbps NIC.
2.8 Verify Routing
Execute:
ip routeThe routing table is correct.
Conclusion: Daily backup traffic saturates the network bandwidth, causing clock sync failures, which lead to lease expiration and the no‑leader condition.
3 Solution
3.1 Expand Bandwidth
Increase network bandwidth to alleviate backup‑time pressure.
3.2 Backup Rate Limiting
Adjust backup_net_limit (0 means no limit) and backup_concurrency (default 10). Setting backup_concurrency to 1 prolongs backup duration and mitigates the no‑leader issue.
Do not modify sys_bkgd_net_percentage as it throttles all observer traffic.
Reference
[1] No‑leader troubleshooting: https://www.oceanbase.com/docs/enterprise-oceanbase-database-cn-10000000000360700
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.