Big Data 16 min read

GooseFS: Accelerating Cloud Storage for Big Data and Data Lake Platforms

GooseFS, Tencent Cloud’s Hadoop‑compatible storage accelerator, adds a local NVMe‑SSD cache layer to cloud‑native data lakes, letting users boost query speeds by up to 46 % and cut backend bandwidth by 200 Gbps without code changes, as demonstrated by a music‑industry customer’s 200‑node deployment caching ten million files.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
GooseFS: Accelerating Cloud Storage for Big Data and Data Lake Platforms

This article introduces GooseFS, a storage acceleration tool developed by Tencent Cloud's object storage team for next-generation cloud-native data lake scenarios. GooseFS provides Hadoop-compatible FileSystem interface implementation to address performance bottlenecks and network bandwidth costs in cloud-based big data/data lake platforms with separated storage and computation.

The article focuses on how a major music customer improved their big data platform efficiency using GooseFS, achieving significant cost reduction. The customer's BI data warehouse platform, built on COS/CHDFS, faced challenges with rapidly growing data access bandwidth (reaching 700Gbps) while needing to further increase read bandwidth and reduce computing resource costs.

GooseFS was deployed as a local acceleration cache layer using the customer's idle NVME SSD resources (approximately 500TB). The solution achieved a 46% peak query performance improvement and reduced backend bandwidth by 200Gbps. The article details GooseFS's core architecture including multi-level storage media (RAM, SSD, HDD), metadata management using RocksDB, high availability through Zookeeper and Raft, and Hive table/partition management capabilities.

Key features discussed include transparent acceleration allowing users to access data without code changes, distributed load pre-warming to prevent bandwidth spikes, and asyncCache optimization for indexed data files. The customer successfully deployed over 200 GooseFS nodes, caching nearly 10 million files, demonstrating significant performance and cost benefits for their big data platform.

performance optimizationBig DataHigh Availabilitycloud storagedata lakemetadata managementCost ReductionGooseFS
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.