Big Data 28 min read

PrestoDB vs Trino: Testing, Selection, Alluxio Acceleration, and Deployment Practices at Zhihu

This article details Zhihu's evaluation of PrestoDB and Trino, the integration of Alluxio for query acceleration, the architectural choices and deployment modes, extensive TPC‑DS and production performance tests, encountered challenges, and future optimization directions for their OLAP platform.

DataFunTalk
DataFunTalk
DataFunTalk
PrestoDB vs Trino: Testing, Selection, Alluxio Acceleration, and Deployment Practices at Zhihu

Presto, a distributed OLAP SQL engine originally from Facebook, offers flexible connectors for heterogeneous data sources but suffers from latency due to its lack of a dedicated storage layer, making network and storage instability a concern for large‑scale interactive analytics.

The engine evolved from PrestoDB → PrestoSQL → Trino, prompting Zhihu to reassess its architecture to meet growing analytical demands while keeping hardware costs stable.

Architecture Overview

Presto follows a classic MPP design with a coordinator node handling query parsing, optimization, and scheduling, and multiple worker nodes executing pipelines entirely in memory, reducing I/O overhead.

It is best suited for small‑to‑medium data volumes that require high flexibility and moderate performance, excelling in interactive reporting, ad‑hoc queries, and data exploration, but not in high‑fault‑tolerance workloads such as massive ETL or machine‑learning pipelines.

Target Goals

Achieve a consistent downward trend in query latency without adding new machines.

Improve median (P50) reporting query performance by at least 50% compared with the previous setup.

Cache Mode Selection

Two primary options were considered:

Deploy an independent Alluxio cluster to provide a unified acceleration layer for Presto, Hive, and Spark.

Enable the built‑in RaptorX cache in PrestoDB, which offers local caching without additional infrastructure.

Given the Kubernetes‑based deployment and the high operational cost of integrating a separate Alluxio cluster, the team initially chose PrestoDB RaptorX for its cost‑effectiveness.

OLAP Performance Testing

5.2 TPC‑DS Benchmark

Tests compared PrestoDB 0.280 with Trino 416 on a 96‑core, 192 GB cluster using a 500 GB TPC‑DS dataset in Parquet and ORC formats, with both Java 11 (Presto) and Java 17 (Trino) clients.

Results showed Trino outperforming PrestoDB on several queries due to advanced features such as dynamic filtering and better utilization of Spark‑generated statistics.

5.3 Production AB Tests

Real‑world workloads (≈20 representative SQLs) were executed on a 1‑coordinator + 64‑worker setup, revealing a 1.5‑2× performance gain for P50 queries after applying RaptorX optimizations.

Challenges and Issues

Cache Customization

Implemented a path‑prefix based CacheFilter in Alluxio to control cache inclusion, aiming for >60% hit rate while keeping the 43 TB cache space efficiently utilized.

Cache Hit‑Rate Measurement

Discovered that worker‑side cumulative metrics obscured true hit rates; added explicit hit/miss counters to obtain windowed statistics, revealing actual hit rates around 30%.

Routing Strategy

The existing random routing in presto‑gateway caused low cache utilization; a hash‑based routing on sorted table names was introduced, boosting shard‑cache hit rates from 5% to 30%.

Footer Metadata Caching

Enabled Alluxio Local Cache inadvertently cached ORC/Parquet footers, leading to read errors on frequently updated tables; disabling this cache resolved the issue.

Feature Gaps vs. Trino

PrestoDB lacks built‑in internal network authentication and comprehensive audit logging, requiring manual code migration from Trino; it also lags in Hadoop 3/EC support, causing compatibility errors with EC‑encoded files.

Conclusions

Adopting PrestoDB RaptorX delivered ≥50% performance improvement for typical reporting queries without additional hardware, reduced remote storage I/O, and demonstrated that careful resource reuse can yield substantial gains.

Future work includes further cache‑hit optimization, small‑file governance, introducing vectorized execution via Velox, and expanding monitoring granularity.

cachingOLAPBigDataAlluxioTrinoPrestoDBPerformanceTesting
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.