Operations 20 min read

Practical Experience of Large‑Scale Elasticsearch Cluster Node Migration at Qunar

This article details the background, challenges, migration plan, automation, and performance‑tuning techniques used to relocate a petabyte‑scale Elasticsearch logging cluster from one data center to another, highlighting practical lessons and measurable improvements in stability and migration speed.

Qunar Tech Salon

Jun 5, 2023

Practical Experience of Large‑Scale Elasticsearch Cluster Node Migration at Qunar

In December 2020 the author joined Qunar's data platform team and became responsible for the company's ESAAS cloud service and real‑time log ELK platform, including the design of SLA rules, ES architecture upgrades, and jinkela cluster splitting.

The Qunar real‑time log platform uses an ELK architecture with the Elasticsearch (ES) cluster and Kibana in data center A and Logstash in data center B. Data center A was saturated and generated heavy cross‑site traffic, prompting a decision to migrate the entire ES cluster to data center B to improve capacity and network reliability.

Migration Architecture : Logs are collected by Filebeat/Fluent‑Bit into Kafka, then processed by Logstash and Flink into per‑appcode indices in ES. Kibana provides access via space→user→role→index‑pattern permissions.

Key Challenges :

Ensuring service availability and zero‑impact migration.

Improving migration efficiency for PB‑level data.

Initial Manual Migration (Nov) : Nodes were excluded in batches of five using the cluster setting

PUT _cluster/settings { "transient": { "cluster.routing.allocation.exclude._name": "data1_node1,data2_node1,..." } }

. This caused many relocating shards, high load during peak write periods, and increased log backlog.

Adjustments : Batch size was reduced during peak hours (2 nodes) and increased during off‑peak (5 nodes), stabilising load and backlog.

Automation (Nov–Jan) : Developed an automated workflow that:

Checks cluster health (green) and load thresholds.

Monitors relocating shard count.

Excludes nodes when safe.

Re‑enables nodes after shard relocation.

Automation reduced manual effort and kept migration throughput steady.

Iterative Optimisations (Jan–Feb) :

Adjusted index.routing.allocation.total_shards_per_node using the formula total_shards_per_node = shard_num/(nodes_count * 0.95 * 0.5) to avoid shard skew.

Set index.unassigned.node_left.delayed_timeout to 120 minutes or a random value (100–300 min) to spread recovery load.

Implemented single‑machine single‑node migration, moving one data node at a time, which increased batch throughput by 50‑80%.

Reduced cluster.routing.allocation.cluster_concurrent_rebalance to 0 and performed manual

POST _cluster/reroute { "commands": [{ "move": { "index": "log_appcode-2023.18", "shard": 59, "from_node": "data2_node1", "to_node": "data2_node10" } }] }

to balance shards.

Migrated coordinate and master nodes with careful host‑mapping and discovery configuration.

Results : Migration speed more than doubled, the entire node migration completed a week ahead of schedule with zero incidents, and the cluster remained stable throughout peak holiday traffic.

Strategic Takeaways :

Plan migration per node type with clear risk mitigation.

Automate wherever possible to improve efficiency and repeatability.

Deep understanding of ES internals (shard allocation, delayed allocation, rebalance) is essential for effective tuning.

Technical Highlights :

total_shards_per_node

node_left.delayed_timeout

single‑machine single‑node migration

reroute‑based shard balancing

References :

https://www.elastic.co/cn/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

https://www.elastic.co/guide/en/elasticsearch/reference/7.7/allocation-total-shards.html

https://www.elastic.co/guide/en/elasticsearch/reference/7.7/delayed-allocation.html

https://www.elastic.co/guide/en/elasticsearch/reference/7.7/index-modules-translog.html

https://www.elastic.co/guide/en/elasticsearch/reference/7.7/cluster-reroute.html

https://cloud.tencent.com/developer/article/1334743

https://blog.csdn.net/laoyang360/article/details/108047071

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automation Elasticsearch Cluster Migration log platform

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.