Elasticsearch Cluster Capacity Planning, Index Configuration, and Performance Optimization
This guide outlines practical capacity‑planning, index‑design, and write‑performance tuning for Tencent Cloud Elasticsearch clusters, covering compute and storage sizing, optimal shard counts, rollover strategies, bulk API settings, health monitoring, and common troubleshooting steps to ensure stable, high‑throughput search services.
With the rapid growth of Tencent Cloud Elasticsearch (ES) services, many users encounter stability and availability issues caused by insufficient early‑stage cluster planning. This article shares practical experience from daily ES operations, covering capacity assessment, index configuration principles, write‑performance tuning, and common troubleshooting methods.
1. Cluster Scale Evaluation
The evaluation focuses on three dimensions:
Compute resources : assess CPU and memory per node. Typical configurations such as 2C8G support ~5k docs/s write throughput, while 32C64G can handle ~50k docs/s.
Storage resources : choose appropriate disk type (SSD for hot nodes, high‑performance cloud disks for warm nodes) and capacity. Mounting multiple disks can increase throughput by ~2.8× compared to a single disk.
Node count : prefer fewer high‑spec nodes (e.g., 3×32C64G) over many low‑spec nodes for better stability and easier scaling.
Two vertical‑scaling methods are commonly used:
Rolling restart : restart nodes one by one, which may affect stability.
Data migration : add high‑spec nodes, migrate data, then remove old nodes (longer and costlier).
2. Index Configuration Assessment
Plan index rollover (daily for small logs, hourly for large volumes) to control index size and improve search performance.
Default primary shard count is 5; adjust based on data volume and query patterns.
Guidelines:
When total shards exceed 100 k, index creation can take minutes and cause write drops. Solutions include pre‑creating indices, using fixed mappings, and employing hot‑warm‑cold tiering with ILM and COS snapshots.
3. Write‑Performance Optimizations
Let ES auto‑generate doc_id to avoid extra existence checks.
Pre‑create indices with fixed mappings to reduce master‑node metadata updates.
Increase refresh_interval (e.g., to 30 s) for non‑real‑time workloads.
During bulk migrations, set replica count to 0 and restore after data transfer.
Use the Bulk API with batch size ≈ 10 MB (≈ 10 000 docs) to reduce network overhead.
Apply custom routing to direct bulk requests to fewer shards:
Prefer SSD disks and mount multiple disks for higher I/O throughput.
Freeze old indices to release memory; frozen indices remain searchable but slower:
Search frozen indices with ignore_throttled=false when needed:
4. Routine Operational Checks
Cluster health (Green/Yellow/Red) can be queried via:
Pending tasks indicate metadata update bottlenecks. Example:
Explain unassigned shards:
List red indices:
Common unassigned‑shard reasons and fixes:
Conclusion
By following the capacity‑assessment guidelines, proper index design, and the performance‑tuning recommendations above, ES clusters can achieve high stability and throughput while reducing operational overhead. The troubleshooting checklist helps quickly locate and resolve common issues, ensuring a reliable search service for Tencent Cloud customers.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.