Databases 13 min read

Nebula: A Scalable Versioned Data Storage Platform for Airbnb Search Backends

Nebula is a schema‑less, versioned data storage service built by Airbnb that unifies real‑time random access and offline batch processing, supporting low‑latency queries, incremental updates, and scalable snapshots using DynamoDB, HFileService, Spark pipelines, and Kafka streams.

High Availability Architecture

Aug 21, 2018

Nebula: A Scalable Versioned Data Storage Platform for Airbnb Search Backends

Introduction

Rapid business growth creates new challenges for search services, prompting Airbnb to design a generic, scalable storage platform that can serve multiple services with low latency and high throughput.

Abstract

The platform must retain user behavior history for personalized search, provide real‑time updates, offer data snapshots for analytics, and support periodic aggregation and bulk feature imports.

These requirements led to the creation of Nebula, a storage platform that satisfies low‑latency operations, incremental stream updates, efficient bulk data handling, scalability, and low maintenance cost.

What is Nebula?

Nebula is a schema‑less, versioned data storage service offering real‑time random data access and offline batch data management. It combines a dynamic store (DynamoDB) for incremental updates with a static snapshot store (HFileService) for bulk data.

Random Data Access Abstraction

Nebula provides a unified key‑value API that abstracts away the underlying physical stores, allowing applications to read and write without caring about whether data is real‑time or batch.

It uses a versioned columnar model similar to BigTable/HBase, enabling unlimited versions per cell and atomic operations.

Built‑in Batch Data Processing

Offline pipelines generate snapshots from incremental data, merge them, and publish new snapshots without affecting online traffic.

Applications can define custom merge, compression, and scheduling policies for their data pipelines.

Architecture

A read request queries both the dynamic (latest) store and the static snapshot store; writes go only to the dynamic store, while snapshot updates are applied by swapping underlying snapshots. Zookeeper coordinates storage metadata.

Dynamic Data Store (DynamoDB)

DynamoDB was chosen for its low latency and ease of management; tables are sharded daily to keep size manageable and maintain high QPS.

Batch Data Store (HFileService)

Static snapshots are stored as HFiles on a cluster that serves data with low latency and high throughput from local disks, with minimal impact on read traffic.

Offline Pipelines for Snapshots, Compression, and Custom Logic

Incremental data is exported to S3, then a Spark job merges it with historical data and custom offline data to produce new snapshots, which are stored back in S3 and the historical store.

Snapshots on S3 are also used for downstream analytics.

Streaming Update Output

Nebula streams updates via DynamoDB Streams, a Kinesis consumer, and publishes them to Kafka so interested services can subscribe.

Other Scenario: Search Index Infrastructure

Nebula powers Airbnb’s search index rebuild, providing end‑to‑end low‑latency operations, batch feature integration, real‑time features, offline index generation, fast rollbacks, rapid scaling of search instances, and auditability.

The versioned columnar storage enables audit of search documents, while batch tasks allow offline feature generation and direct deployment to the search service.

Outlook

Beyond search, Nebula supports many services such as pricing data warehouses, handling several terabytes of data with average 10 ms latency. Future plans include tighter integration with the data warehouse (Hive) and broader data sharing for analytics.

Acknowledgements

Thanks to the many contributors who helped build Nebula, including Alex Guziel, Jun He, Liyin Tang, Jingwei Lu, and numerous teams across search, application, and data infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Spark Search Airbnb DynamoDB Nebula versioned

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.