Big Data 24 min read

Elasticsearch Index Design: Scaling to PB/TP Levels and Best Practices

This article provides a comprehensive guide on designing Elasticsearch indices for massive data volumes, covering shard and replica sizing, mapping strategies, rollover templates, curator cleanup, tokenization choices, query type selection, and multi‑table association techniques to achieve efficient, reliable search at PB‑scale.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Elasticsearch Index Design: Scaling to PB/TP Levels and Best Practices

Introduction

Elasticsearch has become a mainstream solution for both large enterprises and small‑to‑medium businesses, leading to a flood of articles on deployment, frameworks, and performance tuning. However, designing indices for real‑world workloads—especially those handling hundreds of GB of incremental data daily—requires careful planning.

Why Index Design Matters

Good index design influences cluster planning, maintenance cost, and overall system reliability. Ignoring design can cause delayed releases and costly re‑engineering.

1. Designing PB‑Level Indices

For massive data streams, a single index quickly becomes a bottleneck. The recommended workflow is:

Step 1: Create the index.

Step 2: Import or write data.

Step 3: Serve query requests.

1.1 Drawbacks of a Large Index

When daily increments reach billions, a single index suffers from storage limits, performance degradation, and higher failure risk.

1.2 Implementation Using Template + Rollover + Curator

Use index templates to enforce consistent settings, Rollover to create new indices based on age, document count, or size, and Curator to clean up old data.

GET _cat/shards
POST /logs_write/_rollover
{
  "conditions": {
    "max_age": "7d",
    "max_docs": 1000,
    "max_size": "5gb"
  }
}

Alias management ensures that write operations target the latest index while searches can span all historical indices.

POST /_aliases
{
  "actions": [
    { "remove": { "index": "index_2019-01-01-000001", "alias": "index_latest" } },
    { "add":    { "index": "index_2019-01-02-000002", "alias": "index_latest" } }
  ]
}

1.4 Curator for Historical Data Cleanup

Configure Curator with a simple YAML task to delete indices older than a given period, optionally shrinking or force‑merging them.

actions:
  1:
    action: delete_indices
    description: >-
      Delete indices older than 30 days based on index name prefix "logs_".
    options:
      ignore_empty_list: True
    filters:
      - filtertype: pattern
        kind: prefix
        value: logs_
      - filtertype: age
        source: name
        direction: older
        timestring: '%Y.%m.%d'
        unit: days
        unit_count: 30

2. Shard and Replica Design

Shard is an independent Lucene index; replicas provide high availability and parallel search capacity. Recommended shard size is 20‑40 GB (practical range 30‑50 GB) and the number of shards should roughly match the number of data nodes.

2.1 Determining Shard Count

Estimate total data volume (days × daily growth).

Divide by the target shard size (≈30 GB) to get a baseline shard count.

Adjust to align with node count for balanced distribution.

2.2 Replica Settings

For clusters with ≥2 data nodes, set at least one replica to ensure fault tolerance and improve search throughput.

3. Mapping Design

Mapping defines how fields are stored and indexed. Key considerations include field type selection, whether the field needs to be searchable, sortable, aggregatable, or stored separately.

3.1 Static vs Dynamic Mapping

Prefer static mapping for production workloads to control field types and reduce storage overhead.

PUT new_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "status_code": { "type": "keyword" }
      }
    }
  }
}

Note that Elasticsearch does not support direct field deletion or type changes; reindexing is required for major alterations.

3.2 Template Example

PUT _template/test_template
{
  "index_patterns": ["test_index_*", "test_*"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "max_result_window": 100000,
    "refresh_interval": "30s"
  },
  "mappings": {
    "properties": {
      "id": { "type": "long" },
      "title": { "type": "keyword" },
      "content": {
        "type": "text",
        "analyzer": "ik_max_word",
        "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }
      },
      "available": { "type": "boolean" },
      "review": {
        "type": "nested",
        "properties": {
          "nickname": { "type": "text" },
          "text": { "type": "text" },
          "stars": { "type": "integer" }
        }
      },
      "publish_time": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis" },
      "expected_attendees": { "type": "integer_range" },
      "ip_addr": { "type": "ip" },
      "suggest": { "type": "completion" }
    }
  }
}

4. Tokenizer Selection

For Chinese text, ik_max_word provides fine‑grained tokenization, while ik_smart offers coarser segmentation. The article recommends ik_max_word combined with match_phrase for most scenarios.

5. Query Type Selection

Different query types serve distinct purposes:

term : exact match on keyword fields.

prefix : prefix auto‑completion on keyword fields.

wildcard : pattern matching (use with caution).

match : full‑text search on text fields (broad results).

match_phrase : phrase search ensuring token order.

multi_match : full‑text search across multiple fields.

query_string : supports Boolean operators and complex expressions.

bool : combines must, should, must_not, and filter clauses.

Examples of each query type are provided with corresponding curl / POST bodies.

6. Multi‑Table Association Strategies

Four main approaches are discussed:

Materialize a wide view in the relational database and sync it to Elasticsearch.

Keep relational data in the source DB and use Elasticsearch for search, joining back to the DB for detailed data.

Use nested objects for one‑to‑few relationships.

Use parent‑child join type for one‑to‑many scenarios (caution: performance impact).

7. Common Pitfalls

Data cleaning should happen before indexing.

Leverage Elasticsearch’s built‑in highlighter instead of custom implementations.

Avoid using Elasticsearch for heavy transactional logic; keep it to search and simple aggregations.

Invest in proper index design early to prevent costly rework.

SSD improves I/O but does not replace good data modeling.

Because Elasticsearch lacks ACID transactions, consider dual‑write or sync mechanisms with a relational store for strict consistency.

Conclusion

The article consolidates practical experience from near‑ten‑million‑record projects, offering actionable guidelines to help engineers design robust, scalable Elasticsearch solutions without repeating common mistakes.

Elasticsearchindex designmappingscalingCuratorShardRollover
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.