Big Data 10 min read

Implementing ZSTD Compression in Didi's Elasticsearch for High‑Performance Log Ingestion

By integrating ZSTD compression into Didi’s Elasticsearch 7.6, the team cut CPU usage by about 15 %, reduced index storage roughly 30 %, boosted write throughput up to 25 %, and eliminated over 20 servers, demonstrating a faster, more storage‑efficient solution for petabyte‑scale log ingestion.

Didi Tech

Aug 10, 2023

Implementing ZSTD Compression in Didi's Elasticsearch for High‑Performance Log Ingestion

The article introduces Didi's effort to improve Elasticsearch (ES) write performance for massive log ingestion (5‑10 PB per day) by adopting the ZSTD compression algorithm.

ES provides data retrieval through indexes, which consist of shards, each containing segment files that store inverted indexes and document data. The main segment file types are row‑store files (.fdt/.fdx), column‑store files (.dvd), and index‑related files (.tim/.doc).

Because the log cluster is write‑heavy, the row‑store files dominate storage (>30 % of index size). Didi's ES 7.6.0 (Lucene 8.4.0) supports two compression strategies: BEST_SPEED (LZ4) and BEST_COMPRESSION (ZIP). ZIP reduces storage by 20‑40 % compared with LZ4 but increases CPU usage, which can exceed 30 % of the node.

ZSTD (Zstandard) uses FSE encoding, SIMD optimizations, and dictionary compression, offering a good balance of speed and ratio. Benchmarks on a 1 GB log file show ZSTD compresses 4.5× faster and decompresses 1.5× faster than ZIP, with comparable compression ratio.

Implementation steps include:

Extending ES settings and engine to support a ZSTD compression format per shard.

Adding ZSTD support to Lucene via the zstd‑jni library and extending CompressionMode with custom ZStandardCompressor and ZStandardDecompressor.

Parameter tuning: adjusting Chunk Size (set to 60 KB) and selecting an appropriate ZSTD compression level (level 3 for a good speed‑ratio trade‑off, level 9 for higher storage savings).

After three months of testing, the ZSTD‑enabled ES version was deployed to 16 clusters (over 60 k indexes). Results:

Average CPU usage during peak reduced by ~15 %.

Cluster A: CPU usage down 18 %, write‑reject rate down 50 %.

Large index M: CPU usage down 15 %, write throughput up 25 %.

Overall index storage reduced by ~30 % after switching from LZ4 to ZSTD.

Cluster resource reduction enabled the removal of more than 20 machines.

In summary, ZSTD compression provides higher performance and lower cost for Elasticsearch log services, and future work will include read/write separation and major ES version upgrades.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Big Data Elasticsearch lucene zstd compression

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.