Big Data 20 min read

How Tencent Scales Elasticsearch for Massive Log, Search, and Time‑Series Workloads

Tencent leverages Elasticsearch at massive scale across log analytics, search services, and time‑series monitoring, addressing challenges of high availability, low cost, and high performance through kernel optimizations, resource‑aware throttling, cold‑data merging, rollup, caching, and open‑source contributions.

Efficient Ops

Feb 10, 2020

How Tencent Scales Elasticsearch for Massive Log, Search, and Time‑Series Workloads

Elasticsearch (ES) is an open‑source distributed search and analytics engine that easily meets real‑time log analysis, full‑text search, and structured data analysis needs, greatly reducing the cost of extracting value from data in the big‑data era.

Tencent uses ES at massive scale across many internal scenarios and, together with Elastic, provides an enhanced ES cloud service on Tencent Cloud, continuously optimizing native ES for high availability, high performance, and low cost.

1. ES Use Cases at Tencent

The presentation covers: an overview of Tencent’s diverse ES application scenarios and their typical characteristics; the challenges encountered in large‑scale, high‑pressure, varied usage; Tencent’s kernel optimizations for high availability, low cost, and high performance; and thoughts on future plans and open‑source contributions.

Log real‑time analysis: operational logs (slow logs, error logs), business logs (clicks, visits), and audit logs for security analysis. ES solves these needs with features such as a complete logging solution, sub‑10‑second latency from generation to access, flexible inverted‑index and column‑store capabilities, and second‑level query response even on trillion‑scale logs.

Elastic ecosystem provides a full logging solution that can be deployed with mature components.

Log latency typically within 10 seconds, far faster than traditional big‑data solutions.

Supports inverted index and column‑store structures for flexible analysis.

Interactive analysis with second‑level response on trillion‑scale logs.

Search services: product search (e‑commerce), app store search, site search for forums and documentation. Characteristics include high performance (10⁵+ QPS, ~20 ms latency, P95 < 100 ms), strong relevance, and high availability (four‑nines, multi‑datacenter fault tolerance).

High performance: up to 100 k+ QPS per service, ~20 ms response, P95 < 100 ms.

Strong relevance: evaluated by precision, recall, etc.

High availability: four‑nine availability, supports single‑datacenter failures.

Time‑series data analysis: metrics, APM, IoT sensor data. Characteristics include high‑concurrency writes (600+ nodes, 10 M writes/s), low query latency (≈10 ms per curve), and multidimensional analysis (region, business module, etc.).

High‑concurrency writes: up to 600+ nodes, 10 M writes/s.

Low query latency: ~10 ms per time‑series query.

Multidimensional analysis: flexible statistics across dimensions.

2. Challenges

Two main challenge categories: search‑type workloads and time‑series workloads.

Search workloads demand four‑nine availability, tolerate single‑machine or single‑datacenter failures, and require high performance (20 k QPS, ~20 ms latency, P95 < 100 ms). The core challenges are high availability and high performance.

Time‑series workloads emphasize cost and performance. They require massive write throughput (up to 10 M writes/s) and retain 30 days of data, leading to petabyte‑scale storage. However, the actual business value per machine is low, making storage and compute cost a major concern.

3. ES Optimization Practices

High‑availability improvements are divided into three dimensions:

System robustness: fault tolerance under abnormal queries or overload, scalability, data balancing across nodes and disks during expansion.

Disaster‑recovery: rapid recovery from datacenter network failures, natural disasters, and accidental deletions.

System defects: addressing issues such as master node blockage, distributed deadlocks, and slow rolling restarts.

Solutions include service throttling at four layers (permission, queue, memory, multi‑tenant), extending the ES plugin mechanism for backup/restore to cheap storage, cross‑AZ disaster recovery, and a “trash‑bin” mechanism for quick recovery after billing or user errors.

Memory‑level throttling uses JVM memory statistics and gradient monitoring to interrupt aggregation requests when memory is insufficient, improving stability under heavy load.

Cost‑optimization focuses on the hot‑cold data pattern. Disk cost is reduced by separating hot and cold storage, using hybrid storage, pre‑computing (Rollup) to replace raw data, and lifecycle management.

Rollup, introduced in ES 6.x, pre‑computes statistical summaries (similar to OLAP cubes) to lower storage and improve query performance. Tencent implemented streaming multi‑way merge for Rollup, achieving CPU usage < 10 % of full‑data writes and memory < 10 MB.

Memory usage is improved by introducing an LFU cache placed off‑heap, combined with weak references and reduced copy overhead, raising memory utilization by 80 % and cutting GC overhead by 30 % while keeping query performance loss under 2 %.

Performance enhancements include:

Write path: index‑based deduplication improves write speed by 45 %; optimizing translog refresh boosts performance by 20 %.

Query path: merge‑strategy tuning (time‑aware merge) reduces unnecessary segment scans, yielding up to 2× query speedup for search workloads.

Cold‑data automatic merge consolidates inactive indices to ~5 GB segments, aiding time‑series pruning.

4. Future Plans and Open‑Source Contributions

In the past six months Tencent submitted over ten pull requests to the Elastic open‑source project, covering write, query, and cluster‑management modules, and established an internal open‑source collaboration team.

Open‑source participation reduces branch‑maintenance cost, accelerates adoption of upstream features, deepens engineers’ understanding of the kernel, and enhances technical influence in the community.

Future work includes expanding the ES control platform for automated cluster management, exploring OLAP‑style analysis on top of ES, and continuing to strengthen product and kernel capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch high availability cost optimization Open Source Time-series Log Analytics search optimization

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.