Backend Development 10 min read

Design and Architecture of Facebook Haystack Image Storage System

The article analyzes Facebook's massive image storage challenges and explains the Haystack architecture, detailing its components—Directory, Store, and Cache—how it reduces I/O, manages metadata, and handles read/write operations at billions‑scale while also addressing CDN dependency and fault tolerance.

Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Design and Architecture of Facebook Haystack Image Storage System

We first calculate the scale of Facebook's image storage: 260 billion images occupy about 20 PB, averaging roughly 800 KB per image, with weekly additions of 1 billion images (≈60 TB) and a write rate of about 3,500 operations per second, while read peaks can reach one million per second. Images are write‑once, read‑many, and deletions do not modify existing files.

The system has three key challenges: (1) storing massive metadata (≈100 bytes per image, totaling ~26 TB); (2) reducing I/O per read, as traditional Linux filesystems require three disk I/Os per file; (3) caching images close to users, typically via CDN.

The initial architecture used CDN caching and NAS‑backed storage accessed via NFS, but suffered from excessive I/O per read, inability to cache all inode information, and reliance on external CDN providers.

Haystack’s new architecture addresses these issues by allowing multiple logical files to share a single physical file. It consists of three parts: Haystack Directory, Haystack Store, and Haystack Cache.

Haystack Store organizes storage into large physical volumes (e.g., 100 GB each). Each volume is a single physical file containing many "Needles" (image files) with embedded metadata. This reduces the number of files and inode entries dramatically, enabling efficient memory usage.

Haystack Directory maps logical volumes to physical volumes and provides ID allocation, load balancing, CDN bypass, and read‑only marking. It is implemented with a replicated database backed by a Memcache layer.

Haystack Cache reduces dependence on third‑party CDNs by caching recently added images.

Read requests flow through the Directory to construct a URL that may bypass CDN and go directly to the Cache or Store. Write requests first obtain an image ID and a writable logical volume from the Directory, then write the image to multiple physical volumes for redundancy.

Metadata for each image (key, alternate key, offset, size) occupies about 20 bytes. With an 8 TB disk per machine and 800 KB average image size, a machine can store roughly 10 million images, requiring about 200 MB of in‑memory metadata.

To handle node failures, each physical volume has an accompanying index file that stores needle metadata; writes update the volume first and the index asynchronously. Deletions are performed by appending a tombstone needle, with periodic compaction reclaiming space.

Exception handling includes node crashes (requiring data copy from backups) and write failures (requiring all replicas to succeed).

Compared with systems like Taobao TFS and Google GFS, Haystack shares the blob‑file‑system concept but differs in volume size, RAID usage, filesystem choice, and CDN strategy.

backendFacebookLarge Scaleimage storageblob storageHaystack
Art of Distributed System Architecture Design
Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.