Elasticsearch Index Design: Scaling to PB/TP Levels and Best Practices
This article provides a comprehensive guide on designing Elasticsearch indices for massive data volumes, covering shard and replica sizing, mapping strategies, rollover templates, curator cleanup, tokenization choices, query type selection, and multi‑table association techniques to achieve efficient, reliable search at PB‑scale.
Introduction
Elasticsearch has become a mainstream solution for both large enterprises and small‑to‑medium businesses, leading to a flood of articles on deployment, frameworks, and performance tuning. However, designing indices for real‑world workloads—especially those handling hundreds of GB of incremental data daily—requires careful planning.
Why Index Design Matters
Good index design influences cluster planning, maintenance cost, and overall system reliability. Ignoring design can cause delayed releases and costly re‑engineering.
1. Designing PB‑Level Indices
For massive data streams, a single index quickly becomes a bottleneck. The recommended workflow is:
Step 1: Create the index.
Step 2: Import or write data.
Step 3: Serve query requests.
1.1 Drawbacks of a Large Index
When daily increments reach billions, a single index suffers from storage limits, performance degradation, and higher failure risk.
1.2 Implementation Using Template + Rollover + Curator
Use index templates to enforce consistent settings, Rollover to create new indices based on age, document count, or size, and Curator to clean up old data.
GET _cat/shards POST /logs_write/_rollover
{
"conditions": {
"max_age": "7d",
"max_docs": 1000,
"max_size": "5gb"
}
}Alias management ensures that write operations target the latest index while searches can span all historical indices.
POST /_aliases
{
"actions": [
{ "remove": { "index": "index_2019-01-01-000001", "alias": "index_latest" } },
{ "add": { "index": "index_2019-01-02-000002", "alias": "index_latest" } }
]
}1.4 Curator for Historical Data Cleanup
Configure Curator with a simple YAML task to delete indices older than a given period, optionally shrinking or force‑merging them.
actions:
1:
action: delete_indices
description: >-
Delete indices older than 30 days based on index name prefix "logs_".
options:
ignore_empty_list: True
filters:
- filtertype: pattern
kind: prefix
value: logs_
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 302. Shard and Replica Design
Shard is an independent Lucene index; replicas provide high availability and parallel search capacity. Recommended shard size is 20‑40 GB (practical range 30‑50 GB) and the number of shards should roughly match the number of data nodes.
2.1 Determining Shard Count
Estimate total data volume (days × daily growth).
Divide by the target shard size (≈30 GB) to get a baseline shard count.
Adjust to align with node count for balanced distribution.
2.2 Replica Settings
For clusters with ≥2 data nodes, set at least one replica to ensure fault tolerance and improve search throughput.
3. Mapping Design
Mapping defines how fields are stored and indexed. Key considerations include field type selection, whether the field needs to be searchable, sortable, aggregatable, or stored separately.
3.1 Static vs Dynamic Mapping
Prefer static mapping for production workloads to control field types and reduce storage overhead.
PUT new_index
{
"mappings": {
"_doc": {
"properties": {
"status_code": { "type": "keyword" }
}
}
}
}Note that Elasticsearch does not support direct field deletion or type changes; reindexing is required for major alterations.
3.2 Template Example
PUT _template/test_template
{
"index_patterns": ["test_index_*", "test_*"],
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"max_result_window": 100000,
"refresh_interval": "30s"
},
"mappings": {
"properties": {
"id": { "type": "long" },
"title": { "type": "keyword" },
"content": {
"type": "text",
"analyzer": "ik_max_word",
"fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }
},
"available": { "type": "boolean" },
"review": {
"type": "nested",
"properties": {
"nickname": { "type": "text" },
"text": { "type": "text" },
"stars": { "type": "integer" }
}
},
"publish_time": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis" },
"expected_attendees": { "type": "integer_range" },
"ip_addr": { "type": "ip" },
"suggest": { "type": "completion" }
}
}
}4. Tokenizer Selection
For Chinese text, ik_max_word provides fine‑grained tokenization, while ik_smart offers coarser segmentation. The article recommends ik_max_word combined with match_phrase for most scenarios.
5. Query Type Selection
Different query types serve distinct purposes:
term : exact match on keyword fields.
prefix : prefix auto‑completion on keyword fields.
wildcard : pattern matching (use with caution).
match : full‑text search on text fields (broad results).
match_phrase : phrase search ensuring token order.
multi_match : full‑text search across multiple fields.
query_string : supports Boolean operators and complex expressions.
bool : combines must, should, must_not, and filter clauses.
Examples of each query type are provided with corresponding curl / POST bodies.
6. Multi‑Table Association Strategies
Four main approaches are discussed:
Materialize a wide view in the relational database and sync it to Elasticsearch.
Keep relational data in the source DB and use Elasticsearch for search, joining back to the DB for detailed data.
Use nested objects for one‑to‑few relationships.
Use parent‑child join type for one‑to‑many scenarios (caution: performance impact).
7. Common Pitfalls
Data cleaning should happen before indexing.
Leverage Elasticsearch’s built‑in highlighter instead of custom implementations.
Avoid using Elasticsearch for heavy transactional logic; keep it to search and simple aggregations.
Invest in proper index design early to prevent costly rework.
SSD improves I/O but does not replace good data modeling.
Because Elasticsearch lacks ACID transactions, consider dual‑write or sync mechanisms with a relational store for strict consistency.
Conclusion
The article consolidates practical experience from near‑ten‑million‑record projects, offering actionable guidelines to help engineers design robust, scalable Elasticsearch solutions without repeating common mistakes.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.