Design and Implementation of Meituan's Database Autonomy Service (DAS)
This article presents the background, challenges, architectural design, technical solutions, and future roadmap of Meituan's Database Autonomy Service (DAS), a platform that leverages big‑data collection, AI‑assisted root‑cause analysis, and automated operations to improve database performance, reliability, and self‑service capabilities.
Introduction
DAS (Database Autonomy Service) is a Meituan‑built platform that provides performance analysis, fault diagnosis, and security management for databases, using big‑data techniques, machine learning, and expert knowledge to reduce manual operations and improve stability.
Current Situation and Problems
Rapid growth of database instances has outpaced operational capacity, leading to increased incident volume and long MTTR due to reliance on manual DBA analysis.
Key issues include imbalance between scale and ops capability, high demand for stability, and missing critical metrics.
Solution Approach
The team proposes a short‑term and long‑term roadmap that first strengthens basic monitoring (slow queries, active sessions) and then incrementally adds advanced features such as full‑SQL aggregation, root‑cause analysis, and AI‑driven recommendations.
A scientific evaluation system with controllable input and output metrics is established to continuously track product quality.
Technical Architecture
Top‑Level Design
The architecture follows a four‑step evolution: platformization, self‑service, intelligence (expert rules + AI), and full automation.
Data Collection Layer
A hybrid approach uses pcap‑based packet capture as a transitional solution before kernel‑level agents are deployed, ensuring minimal impact on MySQL instances.
Agent design, impact testing, and performance benchmarks are documented.
Compute & Storage Layer
Design principles include in‑memory computation, raw data reporting, aggressive compression, and controlled memory usage.
Full‑SQL data is aggregated per minute using a composite key (RDS_IP + DBName + SQLTemplateID + Minute) and compressed through multi‑stage techniques.
Analysis & Decision Layer
Root‑cause analysis combines expert‑derived rules (GRAI methodology) with AI models for anomaly detection, feature extraction, and classification.
Four maturity stages guide the transition from rule‑only to AI‑dominant diagnosis.
Results
Metrics show improvements in alarm accuracy and recall rates; user cases demonstrate automated alert routing, root‑cause reporting, and slow‑query optimization suggestions.
Future Outlook
Planned work focuses on enhancing compute/storage capacity, advancing database autonomy through SOP automation, and building a flexible incident replay system for continuous model refinement.
Author
Jin Long, Meituan Basic Technology Department – Database Platform R&D Group.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.