Artificial Intelligence 10 min read

Applying Deep Learning to Airbnb Search: Model Evolution, Feature Engineering, and System Insights

This article reviews the Airbnb search ranking paper, detailing offline and online performance gains, the progression from SimpleNN to LambdaRankNN, GBDT/FM NN, and Deep NN models, failed embedding attempts, extensive feature engineering practices, and the production system architecture that enabled large‑scale deep learning deployment.

DataFunTalk
DataFunTalk
DataFunTalk
Applying Deep Learning to Airbnb Search: Model Evolution, Feature Engineering, and System Insights

Effect Overview

The offline and online systems are evaluated using NDCG (Normalized Discounted Cumulative Gain) as the primary metric.

Model Evolution

SimpleNN

One hidden layer with 32 ReLU units.

Features identical to those used by GBDT.

Training objective matches GBDT: minimize MSE where booked = 1, not booked = 0.

Conclusion: SimpleNN yields only modest improvement over GBDT and validates NN feasibility online.

LambdaRankNN

Switch to pairwise loss, minimizing cross‑entropy during training.

Use delta‑NDCG from listwise ranking as weight for the pairwise loss.

Conclusion: Small offline NDCG gain, large online improvement.

GBDT/FM NN

Use the leaf index of each GBDT tree as a categorical feature for the NN.

Feed FM predicted click‑through probability directly as a feature.

Single hidden layer with ReLU activation.

Conclusion: GBDT, FM, and SimpleNN have similar offline performance, but their ranking results differ; fusion of the three models yields higher online revenue.

Deep NN

195 input features (categorical features are embedded).

Two hidden layers: first with 127 units, second with 83 units, both using ReLU.

Significant gains appear after scaling training data by tenfold.

Conclusion: Both offline and online see large gains; training data of 1 billion examples eliminates the offline‑online performance gap, highlighting the importance of data volume for deep learning.

Failed Attempts

Embedding list attempts (list2vec) suffered from severe over‑fitting due to low booking frequency and item‑level constraints, leading to poor online performance.

Rescue attempts included multi‑task training on booking and view‑time, sharing hidden layers to leverage view duration for regularization.

Conclusion: Long‑view experiments improved online metrics substantially, but booking volume did not increase; manual analysis suggested prioritizing high‑price, quirky, or descriptively long items.

Feature Engineering

Key observations:

GBDT is insensitive to feature scaling, while deep learning is highly sensitive to absolute values.

Large value changes cause large gradient swings; extreme values can permanently deactivate ReLU units.

Mapping features to the [-1, +1] range with zero median improves training stability.

Smoothing inputs (e.g., applying log to distance features) enhances generalization.

Embedding sparse categorical features (e.g., street click sequences) remains effective.

Assessing feature importance via Top‑Bot analysis (comparing head vs. tail of ranked lists) helps identify discriminative features such as price.

System Engineering

Airbnb’s production stack:

JavaServer for query handling.

Spark for log collection.

TensorFlow for model training.

JavaNNLibrary for low‑latency online inference.

Data handling evolved from CSV (GBDT era) to Protobufs (TensorFlow era), achieving a 17× speedup and 90% GPU utilization.

Statistical features are aggregated into a non‑trainable embedding matrix for TensorFlow input.

Hyper‑parameter findings:

Dropout did not bring gains.

Uniform initialization in the range [-1, 1] outperformed zero initialization.

Batch size of 200 with lazy Adam optimizer gave the best performance.

Reference: "Applying Deep Learning To Airbnb Search" (https://arxiv.org/abs/1810.09591v2).

Feature Engineeringdeep learningsearch rankingAirbnbNDCGmodel evolution
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.