Databases 11 min read

Deep Dive into Elasticsearch RestClient Sniffer and Node Discovery Mechanism

The article explains how our driver‑passenger matching service migrated from load‑balanced Elasticsearch access to a direct RestClient, then automated node discovery using the built‑in Sniffer and SniffOnFailureListener, detailing its scheduling, request logic, and how this eliminates manual IP management while keeping the client in sync with cluster topology.

HelloTech
HelloTech
HelloTech
Deep Dive into Elasticsearch RestClient Sniffer and Node Discovery Mechanism

Our team is responsible for driver‑passenger matching in a four‑wheel scenario, using the open‑source distributed search engine Elasticsearch to recall orders. We consume Kafka messages with Flink and write order data into corresponding ES indices.

To use Elasticsearch services, a client that can connect to the ES cluster must be created. The official ES Java client offers several options, including Transport client, Low Level REST client, High Level REST client, and Java API client.

Initially we sent requests to an SLB domain (SLB = load balancer) which routed to ES node IPs. As traffic grew, SLB bandwidth limits caused instability during holidays and stress tests.

We switched to direct IP connections using the RestClient, which includes its own IP‑node load‑balancing strategy (implemented with Collections.rotate() ). This solved the SLB bandwidth issue but introduced manual IP list management, which is error‑prone and requires updates on every scaling operation.

To automate node discovery we investigated the sniffing mechanism described in the official documentation. The basic usage is:

RestClient restClient = RestClient.builder(new HttpHost("localhost", 9200, "http")).build();
Sniffer sniffer = Sniffer.builder(restClient).setSniffIntervalMillis(60000).build();

This initializes a sniffer that refreshes the node list every 60 seconds.

Failure sniffing can also be enabled so that the node list is refreshed immediately after a request failure. This requires creating a SniffOnFailureListener and attaching it to the RestClient:

SniffOnFailureListener sniffOnFailureListener = new SniffOnFailureListener();
RestClient restClient = RestClient.builder(new HttpHost("localhost", 9200))
    .setFailureListener(sniffOnFailureListener)
    .build();
Sniffer sniffer = Sniffer.builder(restClient).setSniffIntervalMillis(60000).build();
sniffOnFailureListener.setSniffer(sniffer);

The listener follows the observer pattern: when a request fails, the listener triggers an immediate sniff.

Examining the Sniffer component reveals its core logic. The build() method creates an ElasticsearchNodesSniffer and returns a new Sniffer instance:

public Sniffer build() {
    if (nodesSniffer == null) {
        this.nodesSniffer = new ElasticsearchNodesSniffer(restClient);
    }
    return new Sniffer(restClient, nodesSniffer, sniffIntervalMillis, sniffAfterFailureDelayMillis);
}

The constructor stores references to the nodes sniffer, the RestClient, sniff intervals, and a scheduler that runs a Task periodically:

Sniffer(RestClient restClient, NodesSniffer nodesSniffer, Scheduler scheduler, long sniffInterval, long sniffAfterFailureDelay) {
    this.nodesSniffer = nodesSniffer;
    this.restClient = restClient;
    this.sniffIntervalMillis = sniffInterval;
    this.sniffAfterFailureDelayMillis = sniffAfterFailureDelay;
    this.scheduler = scheduler;
    Task task = new Task(sniffIntervalMillis) {
        @Override
        public void run() {
            super.run();
            initialized.compareAndSet(false, true);
        }
    };
    scheduler.schedule(task, 0L);
}

The Task class implements Runnable and, after executing sniff() , schedules the next task based on nextTaskDelay (the sniff interval).

class Task implements Runnable {
    final long nextTaskDelay;
    final AtomicReference
taskState = new AtomicReference<>(TaskState.WAITING);
    Task(long nextTaskDelay) { this.nextTaskDelay = nextTaskDelay; }
    @Override
    public void run() {
        if (taskState.compareAndSet(TaskState.WAITING, TaskState.STARTED) == false) return;
        try { sniff(); }
        catch (Exception e) { logger.error("error while sniffing nodes", e); }
        finally {
            Task task = new Task(sniffIntervalMillis);
            Future
future = scheduler.schedule(task, nextTaskDelay);
            ScheduledTask previousTask = nextScheduledTask;
            nextScheduledTask = new ScheduledTask(task, future);
        }
    }
}

The actual sniffing logic contacts the cluster via a GET request to /_nodes/http :

@Override
public List
sniff() throws IOException {
    Response response = restClient.performRequest(request);
    return readHosts(response.getEntity(), scheme, jsonFactory);
}

After receiving the node list, restClient.setNodes(sniffedNodes) updates the client’s internal node pool.

Failure sniffing is implemented via the SniffOnFailureListener which, upon a node failure, calls sniffer.sniffOnFailure() to trigger an immediate sniff:

public class SniffOnFailureListener extends RestClient.FailureListener {
    private volatile Sniffer sniffer;
    @Override
    public void onFailure(Node node) {
        if (sniffer == null) {
            throw new IllegalStateException("sniffer was not set, unable to sniff on failure");
        }
        sniffer.sniffOnFailure();
    }
}

Overall, switching from SLB to a static IP list reduced dependency on the load balancer, but manual IP management introduced errors. Enabling Elasticsearch’s built‑in node sniffing automates node discovery, improves scaling efficiency, and ensures the client stays up‑to‑date with cluster topology.

javaelasticsearchdistributed searchNode DiscoveryRestClientsniffer
HelloTech
Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.