Operations 5 min read

Elasticsearch Node Shutdown Process and Risks During Rolling Upgrade

During a rolling upgrade of an Elasticsearch cluster, stopping nodes—especially the master—can block write requests, cause client connection failures, trigger master re‑election, and lead to temporary data duplication, making it essential to understand the shutdown sequence and its impact on read/write operations.

Big Data Technology Architecture

Jul 9, 2019

Elasticsearch Node Shutdown Process and Risks During Rolling Upgrade

When performing a rolling upgrade of an Elasticsearch cluster, nodes are stopped one by one, which can affect ongoing read/write operations and the master node election process.

Conclusion: If the master node is stopped, the cluster re‑elects a new master and enters gateway and index recovery; during this period, write requests are blocked because there is no primary shard.

When a data node is stopped, the TCP connections for read/write requests are closed, causing client failures. Writes that have already reached the Engine stage complete, but the client cannot see the result; retries may cause duplicate data when auto‑generated IDs are used.

Overall, a rolling upgrade interrupts current writes and triggers a master‑node restart, leading to temporary write failures or delays. Clients that retry will not lose data, though duplicates may appear. After the new master is elected, shard allocation can take a long time, extending the retry window.

When primary shards are unassigned, continued writes with auto‑generated IDs may succeed after retries, potentially causing data skew.

Node shutdown basic process

The entry point is o.e.b.Bootstrap#setup, which adds a shutdown hook that runs on SIGTERM or SIGINT.

The shutdown sequence calls Node#close, invoking doStop on each service in a specific order (TribeService, HttpServerTransport, SnapshotsService, IndicesClusterStateService, Discovery, RoutingService, ClusterService, GatewayService, SearchService, TransportService, plugins, IndicesService) and then doClose in reverse.

Close snapshots and HTTP server

Stop cluster topology management (stop responding to ping)

Close network module to take the node offline

Execute plugin shutdown processes

Close IndicesService

Finally close Indices

During write operations, IndicesService#doStop triggers Engine.flushAndClose, acquiring a write lock; the write proceeds safely, but the client connection is dropped due to network shutdown, so the client should treat it as a failure.

Read operations similarly fail because the connection is closed.

During node shutdown, IndicesService#doStop sets a timeout for the Engine; if flushAndClose waits longer than the default (1 day), the shutdown proceeds after the latch expires.

Master node shutdown

The master node follows the normal shutdown flow; after the TransportService stops, the cluster elects a new master, resulting in a brief period without a master during the rolling restart.

if (addShutdownHook) {
  Runtime.getRuntime().addShutdownHook(new Thread() {
    @Override
    public void run() {
      try {
        IOUtils.close(node, spawner);
        LoggerContext context = (LoggerContext) LogManager.getContext(false);
        Configurator.shutdown(context);
      } catch (IOException ex) {
        throw new ElasticsearchException("failed to stop node", ex);
      }
    }
  });
}

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch Rolling Upgrade cluster operations Node Shutdown

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.