Design Choices for Distributed Storage Metadata: Comparing GlusterFS, Hadoop, GridFS, HBase, and FastDFS
The article examines various distributed storage design approaches—decentralized (GlusterFS), centralized (Hadoop), database‑based (GridFS and HBase), and metadata‑bypassing (FastDFS)—detailing their advantages, drawbacks, and practical considerations for cloud storage systems.
Although the storage layer and upload/download mechanisms are important in distributed storage, the choices and trade‑offs for metadata management are even more critical. At QCon Beijing 2015, Qiniu Cloud Storage chief architect Li Daobing shared several storage design methods and analyzed their pros and cons; this article is compiled from that talk.
Decentralized storage designs, such as GlusterFS.
Centralized storage designs, such as Hadoop.
Database‑based storage designs, such as GridFS and HBase.
Metadata‑bypassing designs, such as FastDFS.
Decentralized Storage Design: GlusterFS
GlusterFS is rarely used in Internet services but is common in large‑scale computing. Its design features include:
POSIX compatibility: applications can run without modification.
No central node: eliminates single‑point failures and performance bottlenecks.
Unlimited scalability: adding more servers does not affect the architecture.
We focus on the typical challenges of a center‑less design.
1. Bad disk repair. In a center‑less system the location of a key can be derived, but when a disk fails the naive approach of replacing the disk and copying data can take many hours for large disks, making it unsuitable for big clusters. By partitioning each disk into many small zones (e.g., 50 zones of 80 GB on a 4 TB disk) and storing zone metadata, repair can start immediately and finish in about 13 minutes instead of 8‑9 hours.
2. Scaling. Expanding a cluster from 100 to 200 nodes requires moving roughly half of the data, which can cause network congestion and complex read/write logic. The simplest mitigation is extensive testing and code quality; another approach is to avoid scaling, which conflicts with typical Internet growth patterns.
3. Lack of heterogeneous storage support. Different customers have vastly different I/O requirements. Since keys are hashed, the system cannot know where small files reside, so improving IOPS requires upgrading all machines or massively expanding the cluster, both of which raise cost and complexity.
4. Data inconsistency. Overwriting a key involves deleting the old file and writing a new one; if only some replicas are updated, readers may see corrupted data. A common solution is to write to a temporary file and rename atomically, but this adds considerable complexity in a distributed environment.
Centralized Storage Design: Hadoop
Hadoop is designed for storing large files, offline analytics, and high scalability. Its metadata service (NameNode) follows a master‑slave model, keeping metadata in memory for fast access. High availability can be sacrificed to further improve response performance.
Advantages of Hadoop include:
1. Serves large files. With a 64 MB block size, a 10 PB file requires about 32 GB of memory for metadata, and 1 000 QPS can deliver roughly 512 Gb/s throughput.
2. Supports offline workloads. High‑availability can be partially sacrificed.
3. Scalable storage nodes. Storage nodes can be added without scaling the metadata nodes.
However, Hadoop is not well‑suited for public‑cloud services because:
1. Metadata capacity is too small. 1.6 × 10⁸ metadata entries consume 32 GB; 10 × 10⁹ entries would need 2 TB, which exceeds typical NameNode limits.
2. Metadata node cannot scale. A single NameNode cannot handle the QPS required by large cloud platforms.
3. High availability is imperfect. The NameNode becomes a bottleneck under heavy load.
To overcome these issues, many cloud storage systems replace the single‑center design with algorithms such as WRN (write‑R‑N), where writes succeed after W replicas and reads succeed after R replicas, with W + R > N.
Key considerations for WRN include choosing appropriate W, R, N values and handling partial‑write failures that can corrupt data.
Database‑Based Distributed Storage Design: GridFS and HBase
GridFS
GridFS, built on MongoDB, stores files as 255 KB chunks across two collections: chunks (the data) and files (metadata).
Advantages:
Provides both database and file persistence in a single system, reducing operational cost.
Inherits MongoDB benefits: online storage, high availability, scalability, and cross‑datacenter backup.
Supports Range GET and space reclamation via MongoDB’s periodic maintenance.
Disadvantages:
Oplog exhaustion: heavy file writes quickly fill MongoDB’s oplog, requiring manual repair.
Memory waste: MongoDB’s MMAP maps files into memory, which is inefficient for one‑time file access.
Scalability challenges: sharding requires a custom files_id generation; otherwise writes concentrate on a single shard.
HBase
Given Hadoop’s metadata limitations, HBase is examined as an alternative for small‑file storage.
Advantages:
Scalability and high availability are handled at the lower layer.
Virtually unlimited capacity.
Disadvantages:
Availability issues: NameNode HA problems and Region splits/merges can cause temporary unavailability.
Poor support for large files (>10 MB); a common workaround is to store only file name, offset, and size.
Typical mitigation is to store metadata in HBase while keeping bulk data in Hadoop, though Hadoop’s HA concerns remain.
Bypassing Metadata Design: FastDFS
FastDFS reduces NameNode pressure by encoding metadata into the key itself. For example, the URL group1/M00/00/00/rBAXr1AJGF_3rCZAAAAEc45MdM850_big.txt lets the NameNode translate only the group name to a machine, while the rest of the path contains the needed metadata.
Advantages:
Simple architecture with low metadata node load.
Easy scaling without rebalancing.
Disadvantages:
Key cannot be customized, limiting multi‑tenant flexibility.
Repair speed depends on disk write throughput; a 4 TB disk at 100 MB/s may take over 11 hours.
Large‑file handling is poor; files are not sharded, causing single‑disk bottlenecks.
Summary
Various storage designs each have distinct strengths and weaknesses: decentralized designs like GlusterFS face repair and consistency challenges; centralized designs like Hadoop suffer from metadata scalability and HA limits; database‑based solutions such as GridFS and HBase bring their own operational complexities; and metadata‑bypassing approaches like FastDFS simplify architecture but limit flexibility and large‑file performance.
Disclaimer: The content originates from public internet sources; the author remains neutral and provides it for reference and discussion only. Copyright belongs to the original author or organization; please contact for removal if infringement occurs.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.