Databases 18 min read

How HULK Private Cloud Automates MySQL: Architecture, Scaling, and Recovery

This article details the design, automation, high‑availability architecture, backup strategy, and operational workflow of the HULK private‑cloud MySQL service, covering instance provisioning, table management, testing environments, optimization tips, data recovery, and future plans.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How HULK Private Cloud Automates MySQL: Architecture, Scaling, and Recovery

Abstract

HULK private‑cloud platform introduction

Current MySQL service scale

Common MySQL service features

Design and implementation ideas of each functional module

Future plans

Q&A

Current Status & Service Scale

HULK is 360's internal private‑cloud platform covering cloud computing, databases, big data, monitoring and more. The MySQL service is a core component of the HULK database suite, with over 9,000 instances, daily traffic exceeding 200 billion requests, and total data volume over 270 TB.

Before automation, request communication, resource management and service deployment consumed many man‑hours; the whole process could take hours. After full automation, provisioning can be completed within minutes.

Automation enables developers to submit database requests anytime, improving development efficiency dramatically.

Design & Implementation of Functional Modules

The following sections demonstrate the end‑to‑end workflow from request to instance retirement, illustrating the design of each module.

Create New Instance

Instance creation is fully automated. Users select a package, IDC location, and target machines, then submit a task. The task completes in 30 seconds to 1 minute depending on IDC deployment. After submission, users monitor the ticket status; connection details are emailed upon completion. The underlying task system is a self‑developed QCMD.

Automation requires careful resource analysis and control. Instances are classified into packages, each mapping to specific database resources. Server resources are monitored, scored, and used to guide instance placement.

Database Architecture & High Availability

The default architecture uses dual IDC deployment with Atlas as a middle layer for read/write separation. Atlas is open‑source (https://github.com/Qihoo360/Atlas). LVS provides isolation and load balancing. Each IDC runs multiple service nodes for fault tolerance.

When a primary Atlas node fails, LVS removes it; when a replica fails, Atlas removes it. For primary‑node failure, the MySQL Failover service detects the outage, selects a new master, synchronizes data, rebuilds the replication topology, updates Atlas configuration, and completes the switch in about 15 seconds.

Create / Alter Table

Table creation and alteration are performed via the HULK platform. Sixteen checklist items ensure DDL compliance, covering legality, naming length, unsigned usage, engine selection, data type constraints, index rules, and reserved word avoidance.

Validate table structure legality

Column and table name length ≤ 16

Require UNSIGNED where appropriate

Use InnoDB engine

INT/BIGINT length ≥ 10

VARCHAR length < 3000

TEXT fields ≤ 3

Primary key must be INT

No duplicate indexes

Maximum 5 indexes (including primary)

Index columns must be NOT NULL with default values (except auto‑increment)

SQL must use indexes

SQL must not reference non‑audited tables

No wildcard * in SQL

Auto‑increment columns must be INT or BIGINT

Avoid MySQL reserved words

Example DDL is shown in the following image.

Schema changes use pt‑osc; the team is also evaluating gh‑ost.

Test Environment

A dedicated test environment is provided and linked with the production environment, allowing seamless schema migration between them.

Optimization Suggestions

The platform collects three types of optimization data: slow‑query logs (queries > 0.5 s, aggregated daily and analyzed with pt‑query‑digest), unused indexes (derived from MySQL 5.6 performance_schema), and CHAR field usage (identifying CHAR columns with actual length far below defined length, recommending conversion to VARCHAR).

Data Recovery

Automated recovery tasks let users restore data up to 7 days old without DBA intervention. Users select the target database/table and a point‑in‑time, submit the task, and receive a temporary instance for verification before replacing the production instance.

Data Backup

The backup system provides multi‑dimensional backup (full + incremental binlog), a 4‑2‑2‑1 retention policy (4 days of daily full backups, 2 weeks, 2 months, 1 year), automatic strategy updates based on replica status, storage and network conditions, failure detection with alerting, and automatic expiration cleanup.

Backup is built on Percona XtraBackup with enhancements: per‑table compression for fast single‑table restore, support for both single‑ and multi‑table recovery, data encryption and encrypted transmission, and multiple restore modes (point‑in‑time, binlog position, SQL).

New Plans

After automating routine operations, development efficiency improves and DBAs can focus on service quality and new technology research.

Q&A

Q1: How will future database migration be handled? Can developers operate it easily? A1: The goal is a generic migration service where users fill in source and target configurations with minimal impact on the source.
Q2: Details on DTS implementation for large databases? A2: DTS is still under development; more information will be shared later.
Q3: Do you perform SQL audit logging? A3: Full SQL logs are collected via Atlas; see the GitHub link for details.
Q4: How is the test‑to‑production environment connection achieved? Docker? A4: The HULK platform links environments through business‑level permissions without Docker.
Q5: Where are database instances deployed? How to run multiple MySQL instances on one physical machine? A5: Instances run on physical machines; multiple instances are configured with separate config files and data directories.
Q6: How to handle > 1 million daily orders in MySQL? Sharding? A6: Sharding decisions depend on data volume, traffic, concurrency, and hardware; sometimes a single instance suffices.
Q7: Can you introduce other databases in the platform? A7: Redis has been covered elsewhere; MongoDB and others will be shared in future articles.
Q8: How to configure master‑slave and ensure zero‑downtime during schema changes? A8: Refer to the earlier fault‑tolerance section for failover mechanisms.
Q9: Is installing MySQL inside a VM acceptable? A9: Yes, but be aware of resource contention among VMs on the same host.
Q10: How to manage resource allocation among multiple instances on the same host? A10: Use cgroup, xfs_quota, and TC for CPU, disk, and network isolation; monitor resources and apply limits when thresholds are approached.
Q11: Can horizontal partitioning be dynamically scaled? A11: DTS + internal Atlas can achieve this; details will be open‑sourced when mature.
OperationsHigh AvailabilityMySQLBackupprivate cloudDatabase Automation
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.