Databases 18 min read

Design and Implementation of Meituan's Database Autonomy Service (DAS)

This article presents the background, challenges, architectural design, technical solutions, and future roadmap of Meituan's Database Autonomy Service (DAS), a platform that leverages big‑data collection, AI‑assisted root‑cause analysis, and automated operations to improve database performance, reliability, and self‑service capabilities.

DataFunSummit
DataFunSummit
DataFunSummit
Design and Implementation of Meituan's Database Autonomy Service (DAS)

Introduction

DAS (Database Autonomy Service) is a Meituan‑built platform that provides performance analysis, fault diagnosis, and security management for databases, using big‑data techniques, machine learning, and expert knowledge to reduce manual operations and improve stability.

Current Situation and Problems

Rapid growth of database instances has outpaced operational capacity, leading to increased incident volume and long MTTR due to reliance on manual DBA analysis.

Key issues include imbalance between scale and ops capability, high demand for stability, and missing critical metrics.

Solution Approach

The team proposes a short‑term and long‑term roadmap that first strengthens basic monitoring (slow queries, active sessions) and then incrementally adds advanced features such as full‑SQL aggregation, root‑cause analysis, and AI‑driven recommendations.

A scientific evaluation system with controllable input and output metrics is established to continuously track product quality.

Technical Architecture

Top‑Level Design

The architecture follows a four‑step evolution: platformization, self‑service, intelligence (expert rules + AI), and full automation.

Data Collection Layer

A hybrid approach uses pcap‑based packet capture as a transitional solution before kernel‑level agents are deployed, ensuring minimal impact on MySQL instances.

Agent design, impact testing, and performance benchmarks are documented.

Compute & Storage Layer

Design principles include in‑memory computation, raw data reporting, aggressive compression, and controlled memory usage.

Full‑SQL data is aggregated per minute using a composite key (RDS_IP + DBName + SQLTemplateID + Minute) and compressed through multi‑stage techniques.

Analysis & Decision Layer

Root‑cause analysis combines expert‑derived rules (GRAI methodology) with AI models for anomaly detection, feature extraction, and classification.

Four maturity stages guide the transition from rule‑only to AI‑dominant diagnosis.

Results

Metrics show improvements in alarm accuracy and recall rates; user cases demonstrate automated alert routing, root‑cause reporting, and slow‑query optimization suggestions.

Future Outlook

Planned work focuses on enhancing compute/storage capacity, advancing database autonomy through SOP automation, and building a flexible incident replay system for continuous model refinement.

Author

Jin Long, Meituan Basic Technology Department – Database Platform R&D Group.

big dataAIperformance analysisDatabase AutonomyMeituanRoot Cause Detection
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.