Databases 25 min read

How LeTV Built a Scalable Docker‑Powered RDS Platform

This article examines the limitations of traditional database provisioning, explains why a platform‑based RDS solution is needed, and details LeTV's evolution of its cloud‑native RDS service—including Docker containerization, architecture upgrades, elastic scaling, monitoring, and lessons learned—providing a comprehensive guide for building efficient, automated database platforms.

Efficient Ops
Efficient Ops
Efficient Ops
How LeTV Built a Scalable Docker‑Powered RDS Platform

Traditional DB Bottlenecks and Issues

1.1 Traditional database creation steps

Business users and DBAs request a database, providing workload and resource requirements.

DBA selects physical resources and installs the database.

DBA delivers the database and connection info (IP, port, etc.).

Business users initialize the database, import data, and request read/write or read‑only accounts.

Each step requires DBA involvement; when the number of daily requests grows to dozens or hundreds, DBA workload becomes a bottleneck and slows business deployment.

1.2 Why databases need platformization

Most DBA requests are repetitive; a platform can automate >90% of these tasks, allowing users or DBAs to complete them with a few clicks.

RDS Development

With the rise of IaaS and PaaS, databases have shifted from traditional services to cloud‑based offerings. RDS provides elastic scaling, stability, ease of use, security, performance monitoring, backup, recovery, and cost savings.

Ease of Use

Web‑based management enables rapid deployment and reduces maintenance overhead.

Flexibility

RDS uses standard IP + port connections and integrates with ECS, SLB, GCE, etc.

Horizontal Scalability

Resources can be adjusted quickly at both node and cluster levels without affecting services.

High Availability

Failover to other nodes or clusters prevents single‑point failures.

Cloud‑Native

Automation, resource pooling, and online centralized management reduce manual effort.

LeTV RDS

3.1 Birth

3.1.1 Why LeTV RDS was created

Rapid business growth exposed problems with manual database provisioning, password changes, and performance visibility. LeTV RDS lets users create databases with a single click, granting them management rights while the platform ensures stability.

LeTV RDS is Docker‑based, isolating CPU, memory, and I/O resources to improve stability and utilization.

Since 2014, LeTV has used Docker containers for RDS nodes, an early adopter in China.

3.1.2 Why Docker was chosen

Open‑source and customizable

Fast deployment

Flexibility

Rich image ecosystem

Resource isolation

Lightweight, sufficient for database workloads

3.1.3 Advantages

Speed

Stability

Control

SSD eliminates I/O bottlenecks

All nodes active; no hot standby needed

Read/write from any node

Horizontal read/write scaling

Dynamic node addition/removal transparent to clients

3.1.4 Challenges

Database version and architecture selection : Adopted Percona XtraDB Cluster (PXC) and named the container cluster Mcluster.

Account permission control : Strictly limited privileges.

Physical resource isolation : Managed CPU, memory, and I/O isolation within Docker.

User onboarding : Educating users to manage databases without DBA assistance.

Backup and recovery : Implemented multi‑copy backups with daily and real‑time options.

Global performance monitoring : Defined custom metrics for cluster health.

Alert dimension tuning : Reduced noisy alerts by refining thresholds.

3.2 Development and Scale

LeTV RDS now serves over 900 internal business lines, runs on 3000+ containers, and achieves >60% database usage across the group.

3.3 Overall Architecture

The architecture consists of:

Database layer (MySQL, PostgreSQL, etc.)

Matrix layer for creation, management, monitoring, and resource scheduling

Data Analysis layer for log and user behavior analysis

BeeHive layer for component orchestration

3.4 Basic Usage Flow

Business users : Use Matrix to create, scale, configure permissions, view performance, and manage resources.

DBA & Platform admins : Higher‑level control, performance analysis, and log inspection.

Application servers : Access databases via local Gbalancer middleware.

ES cluster : Collects Mcluster logs for analysis.

Physical resource pool : Supplies resources to Mcluster; can be expanded dynamically.

3.5 Elastic Scaling

Based on a "large resource pool" concept, databases can automatically expand or shrink resources during peak and off‑peak periods, and migrate smoothly between physical servers without downtime.

3.6 Database Alerting

Two alert levels:

Normal : Non‑critical warnings (e.g., read/write latency) sent via SMS, WeChat, email.

Severe : Critical failures (e.g., node outage) trigger phone calls and immediate response.

3.7 Monitoring

Platform‑wide monitoring covers container CPU/memory/I/O, database TPS/QPS, InnoDB usage, and cluster synchronization status, with dashboards for each metric.

3.8 Gbalancer Middleware

Gbalancer provides high availability, load balancing, and read/write splitting with three modes:

Round‑robin across all nodes

Single‑node with failover

Tunnel mode for persistent connections, reducing short‑connection overhead.

Gbalancer is open‑source (https://github.com/zhgwenming/gbalancer).

3.9 Log Collection and Analysis

All logs (error, slow query, container) are stored in Elasticsearch, indexed, and visualized on the Matrix platform for rapid issue detection and root‑cause analysis.

3.10 Pitfalls Encountered

Online modification of large tables impacts the cluster.

Multi‑node writes cause deadlocks.

Massive DML pauses the cluster.

Cluster desynchronization scenarios.

Automatic failover when a replica fails.

Recovery steps for MyISAM tables.

Future Outlook

LeTV RDS will continue to enhance functionality and stability, focusing on Virtual SQL Data Layer (VSDL) and a globally distributed database.

Q&A

Q1: Do you use Zabbix for centralized monitoring?

A1: Yes, Zabbix monitors database status via our custom APIs.

Q2: How do you define and implement operation standardization?

A2: We standardize servers, IDC, network, architecture, containers, etc., using platform‑wide policies to keep pace with growth.

Q3: Which platform powers your operation automation?

A3: Our self‑developed Matrix management platform.

Q4: How do you achieve rapid scaling and deployment?

A4: Docker’s inherent capabilities combined with the beehive program and mcluster‑manager enable fast container deployment and cluster expansion.

Q5: How is your release system built?

A5: A custom system using Python and SaltStack for container‑level upgrades.

Q6: Does Filebeat cause duplicate logs or timeouts?

A6: No duplicate data; we use the date plugin to preserve original timestamps.

Q7: Are there time differences in cluster logs?

A7: Using the date plugin aligns log timestamps with ES storage time.

Q8: What backend storage do you use?

A8: Traditional SAS/SSD for RDS; GlusterFS for backup storage.

Q9: How do you parse logs efficiently?

A9: We extract key metrics into ES; Logstash processes logs via a message queue for higher throughput.

Q10: What problem does Gbalancer’s tunnel mode solve?

A10: It reduces resource consumption from many short connections by establishing persistent tunnels.

Q11: How do you mitigate deadlocks from multi‑node writes?

A11: Separate read and write traffic using different Gbalancer ports and modes.

Q12: How do you orchestrate Docker?

A12: Matrix acts as the orchestrator and beehive as the container manager.

Q13: Is your cluster consistency strong or eventual?

A13: PXC provides strong consistency with multi‑node validation before commit.

Q14: How does the cluster handle abrupt connection loss?

A14: Uncommitted sessions are rolled back automatically.

Q15: Does expansion copy all data?

A15: New nodes restore from shared backup disks; they may perform full SST or incremental recovery depending on the scenario.

MonitoringDockerscalabilityDatabasecloudRDS
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.