Nebula: A Scalable Versioned Data Storage Platform for Airbnb Search Backends
Nebula is a schema‑less, versioned data storage service built by Airbnb that unifies real‑time random access and offline batch processing, supporting low‑latency queries, incremental updates, and scalable snapshots using DynamoDB, HFileService, Spark pipelines, and Kafka streams.
Introduction
Rapid business growth creates new challenges for search services, prompting Airbnb to design a generic, scalable storage platform that can serve multiple services with low latency and high throughput.
Abstract
The platform must retain user behavior history for personalized search, provide real‑time updates, offer data snapshots for analytics, and support periodic aggregation and bulk feature imports.
These requirements led to the creation of Nebula, a storage platform that satisfies low‑latency operations, incremental stream updates, efficient bulk data handling, scalability, and low maintenance cost.
What is Nebula?
Nebula is a schema‑less, versioned data storage service offering real‑time random data access and offline batch data management. It combines a dynamic store (DynamoDB) for incremental updates with a static snapshot store (HFileService) for bulk data.
Random Data Access Abstraction
Nebula provides a unified key‑value API that abstracts away the underlying physical stores, allowing applications to read and write without caring about whether data is real‑time or batch.
It uses a versioned columnar model similar to BigTable/HBase, enabling unlimited versions per cell and atomic operations.
Built‑in Batch Data Processing
Offline pipelines generate snapshots from incremental data, merge them, and publish new snapshots without affecting online traffic.
Applications can define custom merge, compression, and scheduling policies for their data pipelines.
Architecture
A read request queries both the dynamic (latest) store and the static snapshot store; writes go only to the dynamic store, while snapshot updates are applied by swapping underlying snapshots. Zookeeper coordinates storage metadata.
Dynamic Data Store (DynamoDB)
DynamoDB was chosen for its low latency and ease of management; tables are sharded daily to keep size manageable and maintain high QPS.
Batch Data Store (HFileService)
Static snapshots are stored as HFiles on a cluster that serves data with low latency and high throughput from local disks, with minimal impact on read traffic.
Offline Pipelines for Snapshots, Compression, and Custom Logic
Incremental data is exported to S3, then a Spark job merges it with historical data and custom offline data to produce new snapshots, which are stored back in S3 and the historical store.
Snapshots on S3 are also used for downstream analytics.
Streaming Update Output
Nebula streams updates via DynamoDB Streams, a Kinesis consumer, and publishes them to Kafka so interested services can subscribe.
Other Scenario: Search Index Infrastructure
Nebula powers Airbnb’s search index rebuild, providing end‑to‑end low‑latency operations, batch feature integration, real‑time features, offline index generation, fast rollbacks, rapid scaling of search instances, and auditability.
The versioned columnar storage enables audit of search documents, while batch tasks allow offline feature generation and direct deployment to the search service.
Outlook
Beyond search, Nebula supports many services such as pricing data warehouses, handling several terabytes of data with average 10 ms latency. Future plans include tighter integration with the data warehouse (Hive) and broader data sharing for analytics.
Acknowledgements
Thanks to the many contributors who helped build Nebula, including Alex Guziel, Jun He, Liyin Tang, Jingwei Lu, and numerous teams across search, application, and data infrastructure.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.