Big Data 22 min read

Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service

This article details Zhihu's comprehensive cost‑reduction and efficiency‑boosting initiatives for its big‑data platform, covering FinOps‑driven financial operations, hybrid‑cloud architecture, cost allocation models, operational monitoring, and technical optimizations such as erasure coding, ZSTD compression, Spark auto‑tuning, and a remote shuffle service.

DataFunSummit
DataFunSummit
DataFunSummit
Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service

Background : Zhihu operates a high‑quality online Q&A community backed by a hybrid‑cloud architecture (IaaS, SaaS, PaaS) that introduces cost‑management challenges.

FinOps‑Driven Cost Reduction : Since 2022 Zhihu built a FinOps system to align finance and engineering, emphasizing transparent cost measurement, a centralized FinOps team, and performance‑based incentives. Cost allocation follows direct billing or secondary allocation, using both fixed‑price and percentage‑share models to ensure incentives match cost‑saving actions.

Operational System : After establishing a billing framework, Zhihu implements cost alerts, anomaly attribution, and regular review meetings to sustain cost‑saving momentum.

Technical Cost‑Saving Measures :

Erasure Coding (EC) : Replaces three‑replica storage with RS‑6‑3 erasure coding, cutting storage overhead by 50% while maintaining reliability; requires careful selection of warm/cold data due to CPU overhead.

ZSTD Compression : Migrates Parquet files from Snappy or no compression to ZSTD, reducing storage by ~30% (up to 60% for uncompressed tables) and leveraging parquet‑tools for efficient page‑level recompression without schema changes.

Spark Auto‑Tuning : Collects job metrics (CPU, memory, GC, shuffle I/O) via jvm‑profile and sparklens, then uses a heuristic service to suggest optimal executor counts and memory settings, achieving ~30% resource savings.

Remote Shuffle Service (RSS) : Replaces the default External Shuffle Service with Alibaba’s Apache Celeborn, converting random I/O into sequential I/O, supporting partition splitting, and reducing shuffle‑read P99 latency by ~30%.

Summary & Outlook : Combining FinOps governance with the above technical optimizations has delivered sustainable cost reductions. Future plans include migrating Hive workloads to Spark, adopting Gluten+Velox for higher compute efficiency, and exploring hybrid EMR‑fixed‑pool resource models.

Big DataCloud Cost Managementcost optimizationFinOpsZSTDErasure CodingSpark
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.