Operations 20 min read

Practical Experience of Large‑Scale Elasticsearch Cluster Node Migration at Qunar

This article details the background, challenges, migration plan, automation, and performance‑tuning techniques used to relocate a petabyte‑scale Elasticsearch logging cluster from one data center to another, highlighting practical lessons and measurable improvements in stability and migration speed.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Practical Experience of Large‑Scale Elasticsearch Cluster Node Migration at Qunar

In December 2020 the author joined Qunar's data platform team and became responsible for the company's ESAAS cloud service and real‑time log ELK platform, including the design of SLA rules, ES architecture upgrades, and jinkela cluster splitting.

The Qunar real‑time log platform uses an ELK architecture with the Elasticsearch (ES) cluster and Kibana in data center A and Logstash in data center B. Data center A was saturated and generated heavy cross‑site traffic, prompting a decision to migrate the entire ES cluster to data center B to improve capacity and network reliability.

Migration Architecture : Logs are collected by Filebeat/Fluent‑Bit into Kafka, then processed by Logstash and Flink into per‑appcode indices in ES. Kibana provides access via space→user→role→index‑pattern permissions.

Key Challenges : Ensuring service availability and zero‑impact migration. Improving migration efficiency for PB‑level data.

Initial Manual Migration (Nov) : Nodes were excluded in batches of five using the cluster setting PUT _cluster/settings { "transient": { "cluster.routing.allocation.exclude._name": "data1_node1,data2_node1,..." } } . This caused many relocating shards, high load during peak write periods, and increased log backlog.

Adjustments : Batch size was reduced during peak hours (2 nodes) and increased during off‑peak (5 nodes), stabilising load and backlog.

Automation (Nov–Jan) : Developed an automated workflow that: Checks cluster health (green) and load thresholds. Monitors relocating shard count. Excludes nodes when safe. Re‑enables nodes after shard relocation. Automation reduced manual effort and kept migration throughput steady.

Iterative Optimisations (Jan–Feb) : Adjusted index.routing.allocation.total_shards_per_node using the formula total_shards_per_node = shard_num/(nodes_count * 0.95 * 0.5) to avoid shard skew. Set index.unassigned.node_left.delayed_timeout to 120 minutes or a random value (100–300 min) to spread recovery load. Implemented single‑machine single‑node migration, moving one data node at a time, which increased batch throughput by 50‑80%. Reduced cluster.routing.allocation.cluster_concurrent_rebalance to 0 and performed manual POST _cluster/reroute { "commands": [{ "move": { "index": "log_appcode-2023.18", "shard": 59, "from_node": "data2_node1", "to_node": "data2_node10" } }] } to balance shards. Migrated coordinate and master nodes with careful host‑mapping and discovery configuration.

Results : Migration speed more than doubled, the entire node migration completed a week ahead of schedule with zero incidents, and the cluster remained stable throughout peak holiday traffic.

Strategic Takeaways : Plan migration per node type with clear risk mitigation. Automate wherever possible to improve efficiency and repeatability. Deep understanding of ES internals (shard allocation, delayed allocation, rebalance) is essential for effective tuning.

Technical Highlights : total_shards_per_node node_left.delayed_timeout single‑machine single‑node migration reroute‑based shard balancing

References : https://www.elastic.co/cn/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster https://www.elastic.co/guide/en/elasticsearch/reference/7.7/allocation-total-shards.html https://www.elastic.co/guide/en/elasticsearch/reference/7.7/delayed-allocation.html https://www.elastic.co/guide/en/elasticsearch/reference/7.7/index-modules-translog.html https://www.elastic.co/guide/en/elasticsearch/reference/7.7/cluster-reroute.html https://cloud.tencent.com/developer/article/1334743 https://blog.csdn.net/laoyang360/article/details/108047071

automationoperationsElasticsearchPerformance TuningCluster Migrationlog platform
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.