Backend Development 10 min read

Design and Architecture of Facebook Haystack Image Storage System

The article analyzes Facebook's massive image storage challenges and explains the Haystack architecture, detailing its components—Directory, Store, and Cache—how it reduces I/O, manages metadata, and handles read/write operations at billions‑scale while also addressing CDN dependency and fault tolerance.

Art of Distributed System Architecture Design

Mar 20, 2015

Design and Architecture of Facebook Haystack Image Storage System

We first calculate the scale of Facebook's image storage: 260 billion images occupy about 20 PB, averaging roughly 800 KB per image, with weekly additions of 1 billion images (≈60 TB) and a write rate of about 3,500 operations per second, while read peaks can reach one million per second. Images are write‑once, read‑many, and deletions do not modify existing files.

The system has three key challenges: (1) storing massive metadata (≈100 bytes per image, totaling ~26 TB); (2) reducing I/O per read, as traditional Linux filesystems require three disk I/Os per file; (3) caching images close to users, typically via CDN.

The initial architecture used CDN caching and NAS‑backed storage accessed via NFS, but suffered from excessive I/O per read, inability to cache all inode information, and reliance on external CDN providers.

Haystack’s new architecture addresses these issues by allowing multiple logical files to share a single physical file. It consists of three parts: Haystack Directory, Haystack Store, and Haystack Cache.

Haystack Store organizes storage into large physical volumes (e.g., 100 GB each). Each volume is a single physical file containing many "Needles" (image files) with embedded metadata. This reduces the number of files and inode entries dramatically, enabling efficient memory usage.

Haystack Directory maps logical volumes to physical volumes and provides ID allocation, load balancing, CDN bypass, and read‑only marking. It is implemented with a replicated database backed by a Memcache layer.

Haystack Cache reduces dependence on third‑party CDNs by caching recently added images.

Read requests flow through the Directory to construct a URL that may bypass CDN and go directly to the Cache or Store. Write requests first obtain an image ID and a writable logical volume from the Directory, then write the image to multiple physical volumes for redundancy.

Metadata for each image (key, alternate key, offset, size) occupies about 20 bytes. With an 8 TB disk per machine and 800 KB average image size, a machine can store roughly 10 million images, requiring about 200 MB of in‑memory metadata.

To handle node failures, each physical volume has an accompanying index file that stores needle metadata; writes update the volume first and the index asynchronously. Deletions are performed by appending a tombstone needle, with periodic compaction reclaiming space.

Exception handling includes node crashes (requiring data copy from backups) and write failures (requiring all replicas to succeed).

Compared with systems like Taobao TFS and Google GFS, Haystack shares the blob‑file‑system concept but differs in volume size, RAID usage, filesystem choice, and CDN strategy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Facebook large scale Image storage blob storage Haystack

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.