Big Data 14 min read

Alluxio Edge: Edge Caching Solution for Trino and PrestoDB

Alluxio Edge is a library that runs inside Trino or PrestoDB workers, using local SSD or memory to cache data from cloud storage, which restores data locality, cuts storage egress, and delivers up to ten‑fold IO speed gains and up to ten‑fold query performance improvements in real deployments.

Sohu Tech Products

Dec 13, 2023

Alluxio Edge: Edge Caching Solution for Trino and PrestoDB

This article introduces Alluxio Edge, an edge caching solution for Trino and PrestoDB. The presentation covers the background of Alluxio Edge, which emerged due to the decoupling of compute and storage in modern data technology stacks, leading to loss of data locality and increased cloud storage egress costs.

Alluxio Edge is a library that runs within PrestoDB or Trino processes, utilizing local storage (SSD or memory) for data caching. It addresses three main challenges: IO being the primary performance bottleneck, performance fluctuations from storage systems like HDFS affecting query engine IO, and network resource consumption from distributed computing operations.

The reference architecture shows a one-to-one mapping between Trino workers and Alluxio Edge instances. When Trino accesses data from S3 or other storage systems, Alluxio Edge automatically caches data locally. Testing showed 1.5x to 10x end-to-end query performance improvement, and 10x to 50x IO speed improvement on IO-only queries. Cloud storage API calls were reduced by 50% to 90%.

Key features include: local SSD/memory caching, support for multiple data lake connectors (Iceberg, Hudi, DeltaLake, Hive), flexible cache eviction policies (LRU, FIFO, TTL), and data quota functionality. Technical challenges addressed include data consistency (using page versioning), data locality (soft affinity and consistent hashing), and cache utilization (filtering strategies).

Real-world deployments include Uber's deployment across 15,000 nodes in three clusters, achieving 50% end-to-end performance improvement, 10% reduction in HDFS read traffic, 80% avoidance of GCS read requests, and P90 latency reduction from 228 seconds to 50 seconds.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data edge computing Distributed Cache Trino Alluxio Edge data locality PrestoDB

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.