Applying Deep Learning to Airbnb Search: Model Evolution, Feature Engineering, and System Insights
This article reviews the Airbnb search ranking paper, detailing offline and online performance gains, the progression from SimpleNN to LambdaRankNN, GBDT/FM NN, and Deep NN models, failed embedding attempts, extensive feature engineering practices, and the production system architecture that enabled large‑scale deep learning deployment.
Effect Overview
The offline and online systems are evaluated using NDCG (Normalized Discounted Cumulative Gain) as the primary metric.
Model Evolution
SimpleNN
One hidden layer with 32 ReLU units.
Features identical to those used by GBDT.
Training objective matches GBDT: minimize MSE where booked = 1, not booked = 0.
Conclusion: SimpleNN yields only modest improvement over GBDT and validates NN feasibility online.
LambdaRankNN
Switch to pairwise loss, minimizing cross‑entropy during training.
Use delta‑NDCG from listwise ranking as weight for the pairwise loss.
Conclusion: Small offline NDCG gain, large online improvement.
GBDT/FM NN
Use the leaf index of each GBDT tree as a categorical feature for the NN.
Feed FM predicted click‑through probability directly as a feature.
Single hidden layer with ReLU activation.
Conclusion: GBDT, FM, and SimpleNN have similar offline performance, but their ranking results differ; fusion of the three models yields higher online revenue.
Deep NN
195 input features (categorical features are embedded).
Two hidden layers: first with 127 units, second with 83 units, both using ReLU.
Significant gains appear after scaling training data by tenfold.
Conclusion: Both offline and online see large gains; training data of 1 billion examples eliminates the offline‑online performance gap, highlighting the importance of data volume for deep learning.
Failed Attempts
Embedding list attempts (list2vec) suffered from severe over‑fitting due to low booking frequency and item‑level constraints, leading to poor online performance.
Rescue attempts included multi‑task training on booking and view‑time, sharing hidden layers to leverage view duration for regularization.
Conclusion: Long‑view experiments improved online metrics substantially, but booking volume did not increase; manual analysis suggested prioritizing high‑price, quirky, or descriptively long items.
Feature Engineering
Key observations:
GBDT is insensitive to feature scaling, while deep learning is highly sensitive to absolute values.
Large value changes cause large gradient swings; extreme values can permanently deactivate ReLU units.
Mapping features to the [-1, +1] range with zero median improves training stability.
Smoothing inputs (e.g., applying log to distance features) enhances generalization.
Embedding sparse categorical features (e.g., street click sequences) remains effective.
Assessing feature importance via Top‑Bot analysis (comparing head vs. tail of ranked lists) helps identify discriminative features such as price.
System Engineering
Airbnb’s production stack:
JavaServer for query handling.
Spark for log collection.
TensorFlow for model training.
JavaNNLibrary for low‑latency online inference.
Data handling evolved from CSV (GBDT era) to Protobufs (TensorFlow era), achieving a 17× speedup and 90% GPU utilization.
Statistical features are aggregated into a non‑trainable embedding matrix for TensorFlow input.
Hyper‑parameter findings:
Dropout did not bring gains.
Uniform initialization in the range [-1, 1] outperformed zero initialization.
Batch size of 200 with lazy Adam optimizer gave the best performance.
Reference: "Applying Deep Learning To Airbnb Search" (https://arxiv.org/abs/1810.09591v2).
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.