Operations 28 min read

Mastering Elasticsearch Shard Management: From Fundamentals to 100k‑Shard Scale

This article explains Elasticsearch shard fundamentals, primary and replica roles, allocation rules, recovery and rebalance mechanisms, tuning parameters, best‑practice sizing, and presents real‑world production cases—including a 100,000‑shard cluster—along with concrete API commands for effective shard operations.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Mastering Elasticsearch Shard Management: From Fundamentals to 100k‑Shard Scale

Work Background

Routine Elasticsearch operations repeatedly encountered stability problems, performance bottlenecks, and expansion failures. The root causes were unreasonable shard planning, default recovery parameters, and unfamiliarity with the rebalance mechanism. The following sections describe the underlying principles and production‑grade practices for managing more than 100 000 shards.

Core Shard Management Principles

Primary and Replica Shards

Each index is divided into primary shards; a replica shard is an exact copy of its primary. Primary shards handle all write operations and guarantee consistency, while replicas provide high‑availability failover and increase read throughput. The number of primary shards is fixed at index creation; changing it requires index recreation or reindexing.

Shard Allocation Rules

Co‑location rule: a primary shard and its replica never share the same node, preventing total data loss on a single node failure.

Load‑first allocation: Elasticsearch prefers nodes with fewer shards and lower CPU, memory, and disk usage.

Global balance: the cluster continuously evens shard counts across nodes to avoid hot spots.

Automatic allocation may need manual intervention; for example, after adding new nodes newly created indices can be heavily allocated to those nodes, causing overload.

Shard Recovery and Rebalance Mechanisms

Recovery Process

When a node fails or new nodes are added, many shards become unassigned. Elasticsearch automatically recovers these shards by copying data to healthy nodes and restoring replica copies.

Rebalance Process

Topology changes (node addition, removal, or failure) trigger automatic rebalance, moving shards to maintain even load. In small clusters this is beneficial, but in clusters with >100 k shards automatic rebalance often becomes a source of failures.

Common Shard States

STARTED: shard is active and ready for read/write.

UNASSIGNED: shard has no node; requires investigation (disk shortage, allocation rules, node failure).

INITIALIZING: shard is being set up after node restart or index creation.

RECOVERING: shard is syncing data from another node (common after scaling or restart).

RELOCATING: shard is moving to another node during rebalance.

Recovery Parameter Tuning

Default recovery settings are unsuitable for large clusters, leading to slow recovery or resource saturation. The following persistent settings are recommended for production clusters:

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.node_concurrent_recoveries": 8,
    "cluster.routing.allocation.node_concurrent_incoming_recoveries": 8,
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries": 8,
    "cluster.routing.allocation.node_initial_primaries_recoveries": 8,
    "indices.recovery.max_bytes_per_sec": "200mb"
  }
}

When both transient and persistent settings exist, transient overrides but is lost after a restart; verify consistency before changes.

Monitoring Recovery

# 1. Detailed recovery progress
GET /_cat/recovery?v=true&h=i,s,t,ty,shost,thost,f,fp,b,bp
# 2. Cluster health
GET /_cluster/health
# 3. Node disk and heap usage
GET /_cat/nodes?v&h=name,diskUsed,diskAvail,heapUsed,heapMax

Rebalance Parameter Optimization

Rebalance can be a double‑edged sword. The default all setting keeps shards evenly distributed, but frequent large‑scale rebalance consumes bandwidth and I/O, causing query timeouts.

Rebalance Switch

all

– default, all shards may move. primaries – only primary shards may move. replicas – only replica shards may move. none – disable all rebalance (recommended during maintenance, restart, or scaling).

Concurrent Rebalance Control

cluster.routing.allocation.cluster_concurrent_rebalance

defaults to 2; increase for high‑end clusters to accelerate balancing.

Allocation vs. Rebalance

cluster.routing.rebalance.enable

controls whether already allocated shards may be moved (rebalance). cluster.routing.allocation.enable controls whether new shards may be allocated or recovered.

Rebalance Configuration Example

# Disable rebalance during maintenance
PUT /_cluster/settings
{
  "transient": {"cluster.routing.rebalance.enable": "none"}
}
# Enable only primary shard movement
PUT /_cluster/settings
{
  "transient": {"cluster.routing.rebalance.enable": "primaries"}
}
# Restore defaults after maintenance
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.rebalance.enable": "all",
    "cluster.routing.allocation.cluster_concurrent_rebalance": 2
  }
}

Disk Watermark Tuning

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%"
  }
}

Low watermark stops allocating new shards to a node; high watermark triggers migration of existing shards; flood_stage makes all indices on the node read‑only.

Shard Strategy Best Practices

Shard Count Planning

Optimal primary shard size: 10‑30 GB for business indices, up to 50 GB for log indices.

Total shards per index should not exceed twice the number of data nodes.

Cluster‑wide shard count should stay below 100 k.

Shards per node: default ≤ 1 000, maximum recommended ≤ 3 000.

Zero‑Downtime Reindex with Alias Switch

# Create new index with desired shard count
PUT /my-index-v2
{
  "settings": {"number_of_shards": 3, "number_of_replicas": 1},
  "mappings": {"properties": { /* ... */ }}
}
# Asynchronously reindex data
POST /_reindex?wait_for_completion=false
{
  "source": {"index": "my-index"},
  "dest":   {"index": "my-index-v2"}
}
# Switch alias atomically
POST /_aliases
{
  "actions": [
    {"remove": {"index": "my-index", "alias": "my-index-alias"}},
    {"add":    {"index": "my-index-v2", "alias": "my-index-alias"}}
  ]
}

Replica Count Guidelines

Production clusters: 1 replica (default) for high availability.

Read‑heavy, write‑light workloads: increase replicas if enough nodes are available.

Offline log archives: 0 replicas can be used, accepting data‑loss risk.

Node‑Level Shard Limits

PUT /_cluster/settings
{
  "persistent": {"cluster.routing.allocation.shards_per_node": 2}
}

Setting shards_per_node to 2 on a 6‑node cluster with 6 primary shards ensures balanced primary and replica placement.

Production Case Studies

Case 1 – Post‑Expansion Load Imbalance

After expanding a 5‑node cluster to 10 nodes, old nodes remained overloaded while new nodes were idle, causing query latency.

# Increase recovery concurrency
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.node_concurrent_recoveries": 8,
    "cluster.routing.allocation.node_concurrent_incoming_recoveries": 8,
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries": 8,
    "cluster.routing.allocation.node_initial_primaries_recoveries": 8,
    "indices.recovery.max_bytes_per_sec": "200mb"
  }
}
# Enable full rebalance with higher concurrency
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.rebalance.enable": "all",
    "cluster.routing.rebalance.cluster_concurrent_rebalance": 8
  }
}

Monitoring shard distribution and setting shards_per_node for critical indices prevented further imbalance. The key lesson: perform expansion in stages, tune recovery parameters before adding nodes, and re‑enable rebalance only after the cluster stabilizes.

Case 2 – 100k‑Shard Super‑Large Cluster

The Qunar real‑time log platform runs an ELK stack with hundreds of nodes and a total shard count exceeding 100 k, ingesting several hundred TB per day. Major pain points included massive master‑node metadata load, uneven shard allocation, heavy rebalance traffic during node changes, oversized shards (> 100 GB), and heterogeneous disk capacities.

Parameter Hardening

// Persistent recovery settings for large clusters
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.node_concurrent_recoveries": 20,
    "cluster.routing.allocation.node_concurrent_incoming_recoveries": 10,
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries": 10,
    "cluster.routing.allocation.node_initial_primaries_recoveries": 20,
    "indices.recovery.max_bytes_per_sec": "200mb"
  }
}

Higher Disk Watermarks

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "92%",
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "98%"
  }
}

Disable Automatic Rebalance

// Permanently turn off rebalance for the super‑large cluster
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.rebalance.enable": "none",
    "cluster.routing.allocation.cluster_concurrent_rebalance": 0
  }
}

With rebalance disabled, node additions, restarts, or removals no longer trigger massive shard migrations, eliminating cluster “jitter”.

Extended delayed_timeout

PUT log_test-2026.21/_settings
{
  "index.unassigned.node_left.delayed_timeout": "50m"
}

Increasing the timeout from the default 1 minute gives nodes more time to recover without causing unnecessary shard relocations.

Index Template Management

Standard log indices use a common template log_default (e.g., 12 primary shards, 0 replicas). Weekly jobs adjust total_shards_per_node, shard counts, and refresh intervals. For high‑volume indices (e.g., 40 TB on 550 nodes) a specialized template sets number_of_shards to 820 and total_shards_per_node to 2.

PUT _template/log_test_template
{
  "order": 99,
  "index_patterns": ["log_test-*"],
  "settings": {
    "index.routing.allocation.total_shards_per_node": "2",
    "index.refresh_interval": "90s",
    "index.number_of_shards": "820",
    "index.number_of_replicas": "0"
  },
  "mappings": {
    "properties": {
      "reqSize": {"type": "long"},
      "gid": {"type": "keyword", "ignore_above": 256}
    }
  }
}

Custom Offline Shard Scheduler

After disabling native rebalance, a nightly low‑peak scheduler performs controlled shard migrations:

Balance newly created indices by calculating average shards per node and moving excess shards to low‑load nodes.

During low‑peak windows, move hot shards from high‑load nodes to idle nodes, and relocate historical indices from low‑capacity disks (9 TB, 14 TB) to larger disks.

# Example manual move command
POST /_cluster/reroute
{
  "commands": [{
    "move": {
      "index": "target-log-index",
      "shard": 0,
      "from_node": "high_load_node",
      "to_node": "idle_node"
    }
  }]
}

Automation scripts execute these steps, achieving zero‑downtime rebalancing and eliminating instability caused by native aggressive rebalance.

Observed Benefits

Master‑node CPU usage dropped to <10 % by reducing metadata churn.

Node addition, rolling restarts, and removals no longer cause large‑scale shard migrations.

Disk utilization reached near‑maximum while staying within safe watermarks.

Shard recovery time decreased by ~60 %.

High‑Frequency Operational Commands

Cluster Monitoring

# List all shard allocations
GET /_cat/shards
# Show unassigned shards with reasons
GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason

Manual Shard Migration

# Move a specific shard
POST /_cluster/reroute
{
  "commands": [{
    "move": {"index": "my-index", "shard": 0, "from_node": "node1", "to_node": "node2"}
  }]
}
# Allocate a replica shard
POST /_cluster/reroute
{
  "commands": [{
    "allocate_replica": {"index": "my-index", "shard": 0, "node": "node3"}
  }]
}

Summary

Effective Elasticsearch shard management requires a solid understanding of primary/replica roles, allocation policies, recovery and rebalance mechanisms, and the impact of configuration parameters at scale. For small‑to‑medium clusters the default settings suffice, but once shard counts approach the ten‑thousand‑level, automatic mechanisms become sources of instability. Hardening recovery settings, raising disk watermarks, disabling automatic rebalance, and employing a custom offline scheduler restore stability, improve performance, and enable predictable growth for super‑large log platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

elasticsearchperformance tuningLarge ScaleCluster OperationsShard Management
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.