Airbnb’s Data Quality Improvement Plan: Organizational, Architectural, and Governance Practices
Airbnb’s 2019 Data Quality Improvement Plan reorganized its data‑engineering workforce, introduced a dedicated data‑engineer role, adopted a decentralized Minerva‑based architecture with Spark pipelines, instituted rigorous testing, governance, and certification processes, and established SLAs and monitoring to ensure timely, trustworthy, well‑documented data across the enterprise.
Airbnb (Airbnb) has long emphasized data‑driven engineering culture. The company built a strong data science and data engineering team, created a leading data infrastructure, and contributed to open‑source projects such as Apache Airflow and Apache Superset. As Airbnb grew from a fast‑moving startup to a mature enterprise with thousands of employees, it faced new challenges in data warehousing.
Background
Business demands for timeliness, quality, cost control, and compliance (privacy, security, GDPR, etc.) increased. To meet these expectations Airbnb focused on three key areas: data ownership, data architecture, and data governance.
Data Ownership
Previously, data ownership was scattered across product teams and maintained by software engineers and data scientists, leading to unclear responsibilities and slower issue resolution.
Data Architecture
Early data pipelines were built during the startup phase without clear quality standards or a unified strategy, resulting in bulky models and high maintenance overhead.
Data Governance
A centralized governance process was needed to enforce the newly defined strategies and standards.
Data Quality Improvement Plan
In early 2019 Airbnb launched a comprehensive Data Quality Improvement Plan to rebuild its data warehouse with new processes and technologies. The plan set five primary objectives:
Ensure clear ownership of all critical data.
Deliver important data within expected timeframes.
Build data pipelines using high standards and best practices.
Guarantee data trustworthiness and regularly validate accuracy.
Maintain complete, searchable documentation for data.
The implementation focused on four pillars: personnel organization, engineering architecture, best practices, and data‑warehouse management processes.
Personnel Organization
The data engineering team was reorganized and a large data‑engineering community was established. A dedicated “data engineer” role was introduced, requiring cross‑domain skills in data modeling, product development, and software engineering.
Teams were distributed across product groups (minimum three engineers per team) to keep engineers close to user needs, while a central data‑engineering team defined standards, built tools, and managed shared datasets.
Community groups were created, including a Data Engineering Leadership Group, a Data Engineering Forum (monthly all‑hands), a Data Architecture Working Group, and a Data Engineering Tools Working Group.
Recruitment processes were refined to expand the data‑engineering workforce and senior leadership was involved in major hiring decisions.
System Architecture and Best Practices
Airbnb established architectural principles and best‑practice guidelines for data modeling, operations, and technical standards.
Data Modeling
The core_data model, once centralised, became costly to maintain as the business grew. Airbnb shifted to a decentralized model, leveraging the Minerva platform for cross‑model aggregation and metric computation.
Two key principles were defined:
Normalize tables and minimize dependencies; let Minerva handle cross‑model aggregation.
Group tables by subject area, assign clear owners and teams for each domain.
These ideas echo the emerging “Data Mesh” concept.
Technical Standards (Data Technology)
Airbnb migrated from HiveQL‑based pipelines to Spark with Scala APIs, wrapping Spark to simplify read/write patterns and enable integration testing.
Testing
Integration tests are now required for all pipelines and are hooked into CI (Continuous Integration).
Data Quality Checks
New tools were built for data quality validation and anomaly detection, enforcing rules such as unique IDs and logical date constraints.
Operations (运维)
Critical data services now have explicit Service Level Agreements (SLAs) and on‑call rotation with monitoring/alerting (e.g., PagerDuty) for rapid incident response.
Data Governance
A Midas certification workflow was introduced to ensure that design specifications (metrics, dimensions, table schemas, pipeline diagrams) are approved before code and data are released. Certified assets are highlighted to users and receive priority recommendation.
Accountability
A bug‑reporting and weekly review process was established, and pipeline SLAs are tied to team OKRs.
Conclusion
The data quality improvement initiative has re‑introduced the data‑engineer role, defined hiring standards, built a vibrant engineering community, standardized architecture and tooling, launched the Midas certification process, and strengthened accountability and operational practices. While progress is ongoing, Airbnb continues to accelerate infrastructure development, design next‑generation data‑engineering tools, and plans to transition from daily batch processing to real‑time pipelines.
Airbnb Technology Team
Official account of the Airbnb Technology Team, sharing Airbnb's tech innovations and real-world implementations, building a world where home is everywhere through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.