Elasticsearch Node Shutdown Process and Risks During Rolling Upgrade
During a rolling upgrade of an Elasticsearch cluster, stopping nodes—especially the master—can block write requests, cause client connection failures, trigger master re‑election, and lead to temporary data duplication, making it essential to understand the shutdown sequence and its impact on read/write operations.
When performing a rolling upgrade of an Elasticsearch cluster, nodes are stopped one by one, which can affect ongoing read/write operations and the master node election process.
Conclusion: If the master node is stopped, the cluster re‑elects a new master and enters gateway and index recovery; during this period, write requests are blocked because there is no primary shard.
When a data node is stopped, the TCP connections for read/write requests are closed, causing client failures. Writes that have already reached the Engine stage complete, but the client cannot see the result; retries may cause duplicate data when auto‑generated IDs are used.
Overall, a rolling upgrade interrupts current writes and triggers a master‑node restart, leading to temporary write failures or delays. Clients that retry will not lose data, though duplicates may appear. After the new master is elected, shard allocation can take a long time, extending the retry window.
When primary shards are unassigned, continued writes with auto‑generated IDs may succeed after retries, potentially causing data skew.
Node shutdown basic process
The entry point is o.e.b.Bootstrap#setup , which adds a shutdown hook that runs on SIGTERM or SIGINT.
The shutdown sequence calls Node#close , invoking doStop on each service in a specific order (TribeService, HttpServerTransport, SnapshotsService, IndicesClusterStateService, Discovery, RoutingService, ClusterService, GatewayService, SearchService, TransportService, plugins, IndicesService) and then doClose in reverse.
Close snapshots and HTTP server
Stop cluster topology management (stop responding to ping)
Close network module to take the node offline
Execute plugin shutdown processes
Close IndicesService
Finally close Indices
During write operations, IndicesService#doStop triggers Engine.flushAndClose , acquiring a write lock; the write proceeds safely, but the client connection is dropped due to network shutdown, so the client should treat it as a failure.
Read operations similarly fail because the connection is closed.
During node shutdown, IndicesService#doStop sets a timeout for the Engine; if flushAndClose waits longer than the default (1 day), the shutdown proceeds after the latch expires.
Master node shutdown
The master node follows the normal shutdown flow; after the TransportService stops, the cluster elects a new master, resulting in a brief period without a master during the rolling restart.
if (addShutdownHook) { Runtime.getRuntime().addShutdownHook(new Thread() { @Override public void run() { try { IOUtils.close(node, spawner); LoggerContext context = (LoggerContext) LogManager.getContext(false); Configurator.shutdown(context); } catch (IOException ex) { throw new ElasticsearchException("failed to stop node", ex); } } }); }
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.