Big Data 17 min read

Elasticsearch Pagination: From/Size, Deep Paging Issues, and Alternative Methods (Scroll, Search After, PIT)

This article explains how Elasticsearch pagination works with from/size, why deep paging can cause performance problems, and compares alternative techniques such as Scroll, Scroll‑Scan, Sliced Scroll, Search After, and point‑in‑time (PIT) searches for handling large result sets efficiently.

Code Ape Tech Column

Aug 11, 2023

Elasticsearch Pagination: From/Size, Deep Paging Issues, and Alternative Methods (Scroll, Search After, PIT)

Elasticsearch is a real‑time distributed search and analytics engine. This article introduces pagination in Elasticsearch, focusing on the default from and size parameters and the drawbacks of deep paging.

From/Size Parameters

The default query returns the top 10 hits. To paginate, you specify from (number of hits to skip) and size (maximum number of hits to return). Example request:

POST /my_index/my_type/_search
{
  "query": { "match_all": {} },
  "from": 100,
  "size": 10
}

This returns 10 documents starting from the 101st hit.

How the Query Is Executed

Elasticsearch performs a query phase to determine which documents match, then a fetch phase to retrieve the actual document data. The coordinating node creates a priority queue of size from + size and merges results from all shards.

Deep Paging Problems

When from is large, each shard must return from + size hits, causing exponential cost in CPU, memory, I/O, and network. The index.max_result_window defaults to 10 000; exceeding it requires raising this setting.

PUT _settings
{
  "index": { "max_result_window": "10000000" }
}

Deep paging also leads to large data transfers of only _id and _score values.

Alternative Pagination Methods

Scroll

Scroll creates a snapshot of the index and is suited for batch processing large data sets, not real‑time queries. Initialization returns a _scroll_id which is used for subsequent fetches.

POST /twitter/tweet/_search?scroll=1m
{
  "size": 100,
  "query": { "match": { "title": "elasticsearch" } }
}

Subsequent calls use the returned _scroll_id to retrieve the next batch.

Scroll‑Scan

Scroll‑Scan adds search_type=scan to avoid sorting, improving performance when ordering is not required. The size parameter now controls the number of hits per shard.

POST /my_index/my_type/_search?search_type=scan&scroll=1m&size=50
{
  "query": { "match_all": {} }
}

Sliced Scroll

Sliced scroll splits a scroll request into multiple parallel slices, each identified by an id and a total max number of slices.

POST /index/type/_search?scroll=1m
{
  "query": { "match_all": {} },
  "slice": { "id": 0, "max": 5 }
}

Search After

Introduced in ES 5, search_after uses the sort values of the last hit from the previous page to fetch the next page, avoiding deep paging.

POST /twitter/_search
{
  "size": 10,
  "query": { "match": { "title": "es" } },
  "sort": [ { "date": "asc" }, { "_id": "desc" } ]
}

After obtaining the last hit’s sort array, the next request includes it:

GET /twitter/_search
{
  "size": 10,
  "query": { "match": { "title": "es" } },
  "search_after": [124648691, "624812"],
  "sort": [ { "date": "asc" }, { "_id": "desc" } ]
}

Point‑in‑Time (PIT) with Search After

From ES 7, a PIT ID can be created to keep the index state stable across multiple scroll or search‑after requests.

POST /my-index-000001/_pit?keep_alive=1m

The PIT ID is then supplied in the search request:

GET /_search
{
  "size": 10000,
  "query": { "match": { "user.id": "elkbee" } },
  "pit": { "id": "<PIT_ID>", "keep_alive": "1m" },
  "sort": [ { "@timestamp": { "order": "asc", "format": "strict_date_optional_time_nanos", "numeric_type": "date_nanos" } } ]
}

Performance Comparison

From/size works well for small result windows (<10 000). Scroll solves deep paging but incurs snapshot overhead. Search After offers the best performance for large, real‑time pagination but requires a globally unique sort field.

Conclusion

If the data set is small (within 10 000 hits) and only top‑N results are needed, use from/size.

For massive data sets and batch processing, use scroll (or scroll‑scan when sorting is unnecessary).

For large data sets with real‑time, high‑concurrency queries, prefer search_after (optionally with PIT).

Both Scroll and Search After rely on cursor‑like mechanisms to avoid the cost of deep paging, but they are not a complete cure; deep paging should be avoided whenever possible.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Elasticsearch pagination Deep Paging search_after scroll API

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.