How LeTV Built a Scalable Docker‑Powered RDS Platform
This article examines the limitations of traditional database provisioning, explains why a platform‑based RDS solution is needed, and details LeTV's evolution of its cloud‑native RDS service—including Docker containerization, architecture upgrades, elastic scaling, monitoring, and lessons learned—providing a comprehensive guide for building efficient, automated database platforms.
Traditional DB Bottlenecks and Issues
1.1 Traditional database creation steps
Business users and DBAs request a database, providing workload and resource requirements.
DBA selects physical resources and installs the database.
DBA delivers the database and connection info (IP, port, etc.).
Business users initialize the database, import data, and request read/write or read‑only accounts.
Each step requires DBA involvement; when the number of daily requests grows to dozens or hundreds, DBA workload becomes a bottleneck and slows business deployment.
1.2 Why databases need platformization
Most DBA requests are repetitive; a platform can automate >90% of these tasks, allowing users or DBAs to complete them with a few clicks.
RDS Development
With the rise of IaaS and PaaS, databases have shifted from traditional services to cloud‑based offerings. RDS provides elastic scaling, stability, ease of use, security, performance monitoring, backup, recovery, and cost savings.
Ease of Use
Web‑based management enables rapid deployment and reduces maintenance overhead.
Flexibility
RDS uses standard IP + port connections and integrates with ECS, SLB, GCE, etc.
Horizontal Scalability
Resources can be adjusted quickly at both node and cluster levels without affecting services.
High Availability
Failover to other nodes or clusters prevents single‑point failures.
Cloud‑Native
Automation, resource pooling, and online centralized management reduce manual effort.
LeTV RDS
3.1 Birth
3.1.1 Why LeTV RDS was created
Rapid business growth exposed problems with manual database provisioning, password changes, and performance visibility. LeTV RDS lets users create databases with a single click, granting them management rights while the platform ensures stability.
LeTV RDS is Docker‑based, isolating CPU, memory, and I/O resources to improve stability and utilization.
Since 2014, LeTV has used Docker containers for RDS nodes, an early adopter in China.
3.1.2 Why Docker was chosen
Open‑source and customizable
Fast deployment
Flexibility
Rich image ecosystem
Resource isolation
Lightweight, sufficient for database workloads
3.1.3 Advantages
Speed
Stability
Control
SSD eliminates I/O bottlenecks
All nodes active; no hot standby needed
Read/write from any node
Horizontal read/write scaling
Dynamic node addition/removal transparent to clients
3.1.4 Challenges
Database version and architecture selection : Adopted Percona XtraDB Cluster (PXC) and named the container cluster Mcluster.
Account permission control : Strictly limited privileges.
Physical resource isolation : Managed CPU, memory, and I/O isolation within Docker.
User onboarding : Educating users to manage databases without DBA assistance.
Backup and recovery : Implemented multi‑copy backups with daily and real‑time options.
Global performance monitoring : Defined custom metrics for cluster health.
Alert dimension tuning : Reduced noisy alerts by refining thresholds.
3.2 Development and Scale
LeTV RDS now serves over 900 internal business lines, runs on 3000+ containers, and achieves >60% database usage across the group.
3.3 Overall Architecture
The architecture consists of:
Database layer (MySQL, PostgreSQL, etc.)
Matrix layer for creation, management, monitoring, and resource scheduling
Data Analysis layer for log and user behavior analysis
BeeHive layer for component orchestration
3.4 Basic Usage Flow
Business users : Use Matrix to create, scale, configure permissions, view performance, and manage resources.
DBA & Platform admins : Higher‑level control, performance analysis, and log inspection.
Application servers : Access databases via local Gbalancer middleware.
ES cluster : Collects Mcluster logs for analysis.
Physical resource pool : Supplies resources to Mcluster; can be expanded dynamically.
3.5 Elastic Scaling
Based on a "large resource pool" concept, databases can automatically expand or shrink resources during peak and off‑peak periods, and migrate smoothly between physical servers without downtime.
3.6 Database Alerting
Two alert levels:
Normal : Non‑critical warnings (e.g., read/write latency) sent via SMS, WeChat, email.
Severe : Critical failures (e.g., node outage) trigger phone calls and immediate response.
3.7 Monitoring
Platform‑wide monitoring covers container CPU/memory/I/O, database TPS/QPS, InnoDB usage, and cluster synchronization status, with dashboards for each metric.
3.8 Gbalancer Middleware
Gbalancer provides high availability, load balancing, and read/write splitting with three modes:
Round‑robin across all nodes
Single‑node with failover
Tunnel mode for persistent connections, reducing short‑connection overhead.
Gbalancer is open‑source (https://github.com/zhgwenming/gbalancer).
3.9 Log Collection and Analysis
All logs (error, slow query, container) are stored in Elasticsearch, indexed, and visualized on the Matrix platform for rapid issue detection and root‑cause analysis.
3.10 Pitfalls Encountered
Online modification of large tables impacts the cluster.
Multi‑node writes cause deadlocks.
Massive DML pauses the cluster.
Cluster desynchronization scenarios.
Automatic failover when a replica fails.
Recovery steps for MyISAM tables.
Future Outlook
LeTV RDS will continue to enhance functionality and stability, focusing on Virtual SQL Data Layer (VSDL) and a globally distributed database.
Q&A
Q1: Do you use Zabbix for centralized monitoring?
A1: Yes, Zabbix monitors database status via our custom APIs.
Q2: How do you define and implement operation standardization?
A2: We standardize servers, IDC, network, architecture, containers, etc., using platform‑wide policies to keep pace with growth.
Q3: Which platform powers your operation automation?
A3: Our self‑developed Matrix management platform.
Q4: How do you achieve rapid scaling and deployment?
A4: Docker’s inherent capabilities combined with the beehive program and mcluster‑manager enable fast container deployment and cluster expansion.
Q5: How is your release system built?
A5: A custom system using Python and SaltStack for container‑level upgrades.
Q6: Does Filebeat cause duplicate logs or timeouts?
A6: No duplicate data; we use the date plugin to preserve original timestamps.
Q7: Are there time differences in cluster logs?
A7: Using the date plugin aligns log timestamps with ES storage time.
Q8: What backend storage do you use?
A8: Traditional SAS/SSD for RDS; GlusterFS for backup storage.
Q9: How do you parse logs efficiently?
A9: We extract key metrics into ES; Logstash processes logs via a message queue for higher throughput.
Q10: What problem does Gbalancer’s tunnel mode solve?
A10: It reduces resource consumption from many short connections by establishing persistent tunnels.
Q11: How do you mitigate deadlocks from multi‑node writes?
A11: Separate read and write traffic using different Gbalancer ports and modes.
Q12: How do you orchestrate Docker?
A12: Matrix acts as the orchestrator and beehive as the container manager.
Q13: Is your cluster consistency strong or eventual?
A13: PXC provides strong consistency with multi‑node validation before commit.
Q14: How does the cluster handle abrupt connection loss?
A14: Uncommitted sessions are rolled back automatically.
Q15: Does expansion copy all data?
A15: New nodes restore from shared backup disks; they may perform full SST or incremental recovery depending on the scenario.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.