Meituan Database Fault Detection, Diagnosis, and Kernel Observability Practices
The article explains the design of Meituan’s MySQL autonomous platform, detailing its four‑layer architecture, statistical dynamic‑threshold anomaly detection, model selection based on time‑series distribution, kernel‑level root‑cause analysis for replication lag, large‑transaction diagnostics, and crash investigation using core‑dump and signal analysis.
MySQL failures and slow SQL affect database stability and business continuity. Meituan’s autonomous platform iteratively adds end‑to‑end capabilities for anomaly detection, diagnosis, and recovery across large‑scale clusters. The article shares the platform’s evolution, practical lessons, and technical details.
1. Background & Goal
DBAs and developers must quickly discover, analyze, and resolve faults and performance issues. The platform aims to automate these steps, providing observable metrics, precise root‑cause analysis, and automated or self‑service remediation.
2. Platform Evolution Strategy
The platform consists of four layers:
Database Collection Layer : An rds‑agent runs on each instance to collect key metrics and SQL text.
Data Computation & Storage Layer : Metrics flow through Kafka, Flink & Spark for parsing, templating, and aggregation, then are stored in Elasticsearch, Meituan’s Blade, Hive, etc.
Platform Function Layer : Serves both DBA and developer users. Features are built in two parallel tracks—observability, anomaly detection, fault analysis, and handling for DBAs; SQL optimization, risk‑SQL detection, and governance for developers.
Interface & Presentation Layer : Core functions are exposed via a portal and OpenAPI.
3. Anomaly Detection
Early detection prevents small glitches from becoming full‑blown failures. The platform monitors OS and MySQL metrics such as seconds_behind_master, slow_queries, Threads_running, and business‑level indicators like request error count or average SQL latency.
Two detection strategies exist:
Static thresholds —simple but cause many false alarms across diverse workloads.
Dynamic thresholds —model each workload’s historical time‑series using statistical methods.
3.1 Data Distribution & Algorithm Choice
Time‑series can be left‑skewed, right‑skewed, or symmetric. The platform selects algorithms accordingly:
Low‑skew, high‑symmetry → Median Absolute Deviation (MAD)
Moderate skew → Boxplot
High skew → Extreme Value Theory (EVT)
3‑Sigma is avoided because its low tolerance to outliers makes it less robust than MAD for symmetric distributions.
3.2 Model Selection
After analyzing distribution, the system checks for drift, periodicity, and stationarity. If drift is detected, the series is split at the drift point t. Periodic data is bucketed by the period T and modeled per bucket; non‑periodic data is modeled as a whole. Finally, the chosen algorithm (MAD, Boxplot, or EVT) is applied based on calculated skewness.
Trained models are stored offline and loaded at runtime; Flink evaluates incoming metrics against the appropriate model to raise alerts.
4. Anomaly Diagnosis
Once an anomaly is flagged, the platform performs root‑cause analysis using kernel‑level insights rather than solely expert rules. The article details several diagnostic paradigms.
4.1 Replication Lag (Kernel Code‑Path Analysis)
The seconds_behind_master metric is computed as:
seconds_behind_master = (long)(time(0) - mi->rli->last_master_timestamp) - mi->clock_diff_with_masterKey variables: time(0): current system time (seconds). mi->rli->last_master_timestamp: timestamp of the latest replayed transaction group. mi->clock_diff_with_master: time difference between master and slave, affected by sql_delay and slave_checkpoint_period.
Source code shows how last_master_timestamp originates from rli->gaq->head_queue() → new_ts, which is derived from the transaction’s start time and execution duration ( when.tv_sec + exec_time). The article provides the relevant snippets:
bool mts_checkpoint_routine(Relay_log_info *rli, ulonglong period,
bool force, bool need_data_lock)
{
do
{
cnt = rli->gaq->move_queue_head(&rli->workers);
}.......................
ts = rli->gaq->empty()
? 0
: reinterpret_cast<Slave_job_group*>(rli->gaq->head_queue())->ts; // ts from get_slave_worker
rli->reset_notified_checkpoint(cnt, ts, need_data_lock, true);
// community version: rli->reset_notified_checkpoint(cnt, rli->gaq->lwm.ts, need_data_lock);
/* end-of "Coordinator::"commit_positions */
......................
} Slave_worker *Log_event::get_slave_worker(Relay_log_info *rli)
{
if (ends_group() || (!rli->curr_group_seen_begin &&
(get_type_code() == binary_log::QUERY_EVENT || !rli->curr_group_seen_gtid)))
{
..............
ptr_group->checkpoint_seqno = rli->checkpoint_seqno;
ptr_group->ts = common_header->when.tv_sec + (time_t)exec_time; // Seconds_behind_master related
rli->checkpoint_seqno++;
}
}By tracing these paths, the platform enumerates all factors influencing replication lag, such as transaction start time, execution time, sql_delay, and checkpoint periods, solving the “cannot enumerate all root causes” problem.
4.2 Large‑Transaction Diagnosis (Kernel Enhancements)
Large transactions can hold locks for extended periods or modify many rows, stressing resources. Two challenges are identified:
Missing a complete SQL list for a transaction.
Unclear composition of transaction latency (SQL execution vs. external wait).
Solutions:
Assign a unique trx_id at transaction start (instead of only during write) and attach it to every SQL statement, enabling full reconstruction of the transaction’s SQL list.
Instrument the kernel to record timestamps at SQL start and end, separating execution time from Sleep time. The platform also captures network‑level latency via mysql_socket_send_time and vio_socket_io_wait_time.
Resulting UI shows the full SQL list, per‑SQL execution time, and Sleep intervals, helping users pinpoint the exact cause of a large transaction.
4.3 MySQL Crash Analysis (Core Dump & Signal)
Crashes are classified into two categories: self‑induced (MySQL calls abort()) and OS‑induced (signal from the kernel). The article outlines the typical flow:
MySQL detects a critical condition (data corruption, disk full, latch timeout) and triggers ut_a or ut_error, which call ut_dbg_assertion_failed and then abort(). abort() raises SIGABRT twice—first to invoke the custom handler handle_fatal_signal (which logs thread ID, file, line), second to let the OS generate a core dump.
Signal‑type analysis guides root‑cause direction:
Signal 6 : Disk full/read‑only or data corruption; often accompanied by “Either disk is full …” or “Database page corruption …”.
Signal 7 : Hardware memory error (MCE logs).
Signal 9 : Process killed by kill.
Signal 11 : MySQL bug; requires inspecting the SQL that was executing (found in m_query_string) or the core dump.
Examples show how a corrupted record type (value 7 instead of 0‑3) leads to a crash, and how fixing the kernel’s prev_rec check prevents the fault.
5. Author & References
Author: 裕锋, Meituan Base R&D Platform – Database Autonomous Platform team.
References include GitHub repositories for the Percona Server source, crash‑log examples, and several academic papers on self‑driving databases and statistical modeling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
