Investigation and Resolution of Elasticsearch node_concurrent_recoveries Performance Issue
The team traced read‑request timeouts to a single overloaded Elasticsearch node where an excessively high node_concurrent_recoveries setting caused many simultaneous shard recoveries and disk‑watermark‑driven relocations, and resolved the issue by lowering concurrent recoveries, enabling adaptive replica selection, and adjusting allocation settings.
This article documents a complete troubleshooting process for a performance problem in an Elasticsearch cluster caused by the node_concurrent_recoveries setting.
Fault description
Business read requests started timing out around 19:30. No recent releases or traffic spikes were observed, and the cluster had more than 30 data nodes.
Environment
Elasticsearch version 6.x
Cluster size: >30 data nodes
Initial investigation
Because Elasticsearch routes queries to shard instances, the overall query latency is determined by the slowest shard. The team first checked whether the issue was cluster‑wide or limited to a single instance.
Monitoring graphs showed that one node (referred to as instance A) exhibited abnormal metrics starting at 19:30:
es.node.threadpool.search.queue reached ~1000 (queue full)
es.node.threadpool.search.rejected peaked over 100
es.node.threadpool.search.completed grew due to client retries
Instance A’s es.node.threadpool.search.completed was >50% higher than other nodes, indicating hotspot indices.
CPU usage ( es.node.threadpool.cpu.percent ) increased by >50%.
Search query time metrics ( es.node.indices.search.querytime , es.node.indices.search.querytimeinmillis ) also rose.
Further analysis pointed to machine Z (hosting instance A) where both CPU and disk I/O surged during the incident.
CPU usage on machine Z reached >2000% (theoretical max 3200% for 32 cores), confirming full CPU saturation.
Root‑cause analysis
Two possibilities were considered: excessive concurrency or long‑running tasks.
Concurrency was ruled out because business traffic was stable and search.completed growth was only due to retries.
Long‑running tasks were investigated using the _cat/tasks API:
curl -XGET '/_cat/tasks?v&s=store' -s | grep AMost long tasks were shard relocation operations. The cluster settings showed:
{
"transient": {
"cluster": {
"routing": {
"allocation": {
"node_concurrent_recoveries": "5",
"enable": "all"
}
}
}
}
}High node_concurrent_recoveries allowed many shard recoveries to run simultaneously, causing heavy CPU load on the recovering node.
Disk usage also hit the high‑watermark (90%), triggering automatic shard relocation:
[xxxx-xx-xxT19:43:28,389][WARN][o.e.c.r.a.DiskThresholdMonitor] [master] high disk watermark [90%] exceeded on [ZcphiDnnStCYQXqnc_3Exg][A][/xxxx/data/nodes/0] free: xxxgb[9.9%], shards will be relocated away from this nodeThus, the combination of aggressive shard recovery and disk‑watermark‑driven relocations overloaded instance A.
Solution
1. Verify the hypothesis by excluding the problematic node from the cluster:
curl -XPUT /_cluster/settings?pretty -H 'Content-Type:application/json' -d '{
"transient":{
"cluster.routing.allocation.exclude._ip": "xx.xx.xx.xx"
}
}'After exclusion, request timeouts dropped dramatically.
2. Apply permanent fixes:
Reduce cluster.routing.allocation.node_concurrent_recoveries (default is 2). Adjust cautiously and monitor CPU, I/O, and network.
Enable adaptive replica selection:
curl -XPUT /_cluster/settings?pretty -H 'Content-Type:application/json' -d '{
"transient":{
"cluster.routing.allocation.node_concurrent_recoveries": 2,
"cluster.routing.use_adaptive_replica_selection": true
}
}'3. Consider scaling out or migrating instances to relieve pressure.
Summary
The incident was caused by mis‑configured shard allocation parameters that led to excessive concurrent recoveries, overwhelming CPU on a single node and triggering disk‑watermark‑driven shard migrations. Proper monitoring, limiting concurrent recoveries, and enabling adaptive replica selection resolved the issue and restored stable service.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.