Operations 16 min read

Investigation and Resolution of Elasticsearch node_concurrent_recoveries Performance Issue

The team traced read‑request timeouts to a single overloaded Elasticsearch node where an excessively high node_concurrent_recoveries setting caused many simultaneous shard recoveries and disk‑watermark‑driven relocations, and resolved the issue by lowering concurrent recoveries, enabling adaptive replica selection, and adjusting allocation settings.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Investigation and Resolution of Elasticsearch node_concurrent_recoveries Performance Issue

This article documents a complete troubleshooting process for a performance problem in an Elasticsearch cluster caused by the node_concurrent_recoveries setting.

Fault description

Business read requests started timing out around 19:30. No recent releases or traffic spikes were observed, and the cluster had more than 30 data nodes.

Environment

Elasticsearch version 6.x

Cluster size: >30 data nodes

Initial investigation

Because Elasticsearch routes queries to shard instances, the overall query latency is determined by the slowest shard. The team first checked whether the issue was cluster‑wide or limited to a single instance.

Monitoring graphs showed that one node (referred to as instance A) exhibited abnormal metrics starting at 19:30:

es.node.threadpool.search.queue reached ~1000 (queue full)

es.node.threadpool.search.rejected peaked over 100

es.node.threadpool.search.completed grew due to client retries

Instance A’s es.node.threadpool.search.completed was >50% higher than other nodes, indicating hotspot indices.

CPU usage ( es.node.threadpool.cpu.percent ) increased by >50%.

Search query time metrics ( es.node.indices.search.querytime , es.node.indices.search.querytimeinmillis ) also rose.

Further analysis pointed to machine Z (hosting instance A) where both CPU and disk I/O surged during the incident.

CPU usage on machine Z reached >2000% (theoretical max 3200% for 32 cores), confirming full CPU saturation.

Root‑cause analysis

Two possibilities were considered: excessive concurrency or long‑running tasks.

Concurrency was ruled out because business traffic was stable and search.completed growth was only due to retries.

Long‑running tasks were investigated using the _cat/tasks API:

curl -XGET '/_cat/tasks?v&s=store' -s | grep A

Most long tasks were shard relocation operations. The cluster settings showed:

{
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "node_concurrent_recoveries": "5",
          "enable": "all"
        }
      }
    }
  }
}

High node_concurrent_recoveries allowed many shard recoveries to run simultaneously, causing heavy CPU load on the recovering node.

Disk usage also hit the high‑watermark (90%), triggering automatic shard relocation:

[xxxx-xx-xxT19:43:28,389][WARN][o.e.c.r.a.DiskThresholdMonitor] [master] high disk watermark [90%] exceeded on [ZcphiDnnStCYQXqnc_3Exg][A][/xxxx/data/nodes/0] free: xxxgb[9.9%], shards will be relocated away from this node

Thus, the combination of aggressive shard recovery and disk‑watermark‑driven relocations overloaded instance A.

Solution

1. Verify the hypothesis by excluding the problematic node from the cluster:

curl -XPUT /_cluster/settings?pretty -H 'Content-Type:application/json' -d '{
  "transient":{
    "cluster.routing.allocation.exclude._ip": "xx.xx.xx.xx"
  }
}'

After exclusion, request timeouts dropped dramatically.

2. Apply permanent fixes:

Reduce cluster.routing.allocation.node_concurrent_recoveries (default is 2). Adjust cautiously and monitor CPU, I/O, and network.

Enable adaptive replica selection:

curl -XPUT /_cluster/settings?pretty -H 'Content-Type:application/json' -d '{
  "transient":{
    "cluster.routing.allocation.node_concurrent_recoveries": 2,
    "cluster.routing.use_adaptive_replica_selection": true
  }
}'

3. Consider scaling out or migrating instances to relieve pressure.

Summary

The incident was caused by mis‑configured shard allocation parameters that led to excessive concurrent recoveries, overwhelming CPU on a single node and triggering disk‑watermark‑driven shard migrations. Proper monitoring, limiting concurrent recoveries, and enabling adaptive replica selection resolved the issue and restored stable service.

performanceElasticsearchClustertroubleshootingCPUDisk WatermarkShard Allocation
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.