Artificial Intelligence 22 min read

Airbnb’s Task‑Oriented Dialogue System for Mutual Cancellation: Architecture, Data Collection, Modeling, and Deployment

Airbnb’s ATIS task‑oriented dialogue system for Mutual Cancellation combines hierarchical domain classification, Q&A‑style intent annotation, large‑scale RoBERTa pre‑training with multilingual fine‑tuning, multi‑turn context handling, GPU‑accelerated inference, and contextual‑bandit reinforcement learning to deliver a scalable, efficient customer‑support solution.

Airbnb Technology Team
Airbnb Technology Team
Airbnb Technology Team
Airbnb’s Task‑Oriented Dialogue System for Mutual Cancellation: Architecture, Data Collection, Modeling, and Deployment

Customer support (CS) is a critical part of the Airbnb guest experience. To improve CS efficiency, Airbnb invested heavily in natural language processing (NLP), machine learning (ML) and artificial intelligence (AI) to build an automated, task‑oriented dialogue system for the newly launched “Mutual Cancellation” feature.

The article uses the Mutual Cancellation use case to illustrate the end‑to‑end AI pipeline: converting a business problem into an AI problem, collecting and labeling training data, designing models, and deploying them in production. It also discusses technical challenges and innovative solutions at each step.

System Architecture : The platform, called ATIS (Automatic Travel Information System), is a task‑oriented dialogue system that first classifies a user’s message into a domain (e.g., re‑booking, cancellation, article recommendation). If the domain is predicted as Mutual Cancellation, a second‑layer stack is invoked: an intent‑understanding model trained on Q&A data and a “expected refund rate” model trained on historical cancellation records.

The multi‑layer design is both scalable (new domains can be added without affecting existing ones) and effective (top‑level domain classifier uses high‑quality manually labeled data, while domain‑specific models leverage abundant but noisy historical data).

Training Data Collection & Annotation : Airbnb built a hierarchical intent tree but found it too rigid for the complex, unstructured complaints typical of the sharing‑economy. Instead, they switched to a Q&A‑style annotation where annotators answer a set of binary or single‑choice questions about each user utterance. This approach simplifies label management and enables the use of single‑choice binary classification for each question.

Examples of annotated Q&A pairs are provided (e.g., “Who initiated the cancellation request?” with multiple possible answers). The single‑choice setting allows mixing different versions of questions during training, improving data efficiency.

Model Design & Pre‑training : An Autoencoder‑Transformer architecture (Roberta‑large) was selected after benchmarking several intent‑classification models. Transfer learning is applied in two ways: (1) domain‑specific masked language model (MLM) pre‑training on a 1.08 GB corpus of Airbnb dialogs, help‑center articles and listings (14 languages, 56 % English); (2) cross‑domain task‑fine‑tuning on public multilingual datasets. Both methods yielded significant performance gains.

Multi‑language support is achieved with XLM‑RoBERTa, trained on translated English labels into the 13 other supported languages (French, Spanish, German, Portuguese, etc.). Experiments show the multilingual model outperforms a monolingual English‑only baseline.

Multi‑turn Intent Prediction : To capture context, the system concatenates the current user message with the previous N turns and feeds the combined sequence to the Transformer. Two offline strategies were explored: (a) adding the last N messages as extra features; (b) computing multi‑turn intent scores and feeding the highest score downstream. Because Transformer inference scales as O(n⁴) with sequence length, historical conversation embeddings are pre‑computed offline and cached for online lookup.

Online Service & GPU Acceleration : Deploying large models (up to 2.82 B parameters) in production required latency reduction. Airbnb enabled GPU inference and used knowledge‑distillation (teacher‑student models) to lower compute cost. Benchmarks on various instance types (g4dn.xlarge, p3.2xlarge, r5.2xlarge) demonstrated up to 3× speed‑up on GPUs and 5× on batch processing. After switching to GPU, 95 % of requests achieved ~60 ms latency for average 100‑token inputs.

Contextual Bandit Reinforcement Learning : To continuously improve the model with limited traffic, Airbnb applied a contextual‑bandit framework. Three possible actions (a0, a1, a2) correspond to different user flows in the Mutual Cancellation process. Rewards are defined by entry‑rate and acceptance‑rate of the flow. A greedy‑epsilon exploration strategy and self‑normalized inverse propensity scoring (IPS) estimator are used to evaluate and update policies.

Conclusion : The case study demonstrates how Airbnb combined single‑choice Q&A modeling, large‑scale pre‑training, multilingual training, multi‑turn context tracking, GPU‑accelerated inference, and contextual‑bandit reinforcement learning to build a robust, scalable AI‑driven customer support system that reduces manual effort and improves guest‑host satisfaction.

Machine LearningAImultilingualCustomer SupportGPU deploymenttask-oriented dialogue
Airbnb Technology Team
Written by

Airbnb Technology Team

Official account of the Airbnb Technology Team, sharing Airbnb's tech innovations and real-world implementations, building a world where home is everywhere through technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.