Operations 13 min read

Evolution of Meizu Flyme Operations Architecture and High‑Availability Practices

The article details Meizu's Flyme operations platform evolution—from a single‑cabinet setup in 2011 to a multi‑IDC, 6000‑server infrastructure—highlighting challenges, architectural upgrades, monitoring, cost control, automation, and future high‑availability directions for large‑scale internet services.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Evolution of Meizu Flyme Operations Architecture and High‑Availability Practices

This article presents a case study of Meizu's Flyme operations platform evolution, shared by Qian Jun at the "Internet Architecture: From 1 to 100" private salon held on August 20 in Shenzhen.

Background

Meizu entered the internet business early, becoming a mobile internet company in 2014; by the end of 2015 Flyme had over 30 million registered users, more than 1 million apps, over 10 billion downloads, and revenue growth of 12×, which put increasing pressure on operations.

System Operations Architecture Evolution

1. Ancient Era (2011.1‑2011.12)

Scale: 1 cabinet, 5 servers, 2 services, part‑time dev‑ops.

Problems: data‑center stability, missing monitoring, single‑point architecture.

2. Stone Age (2012.1‑2014.6)

Scale: 1 IDC, 30 cabinets, 800 servers/VMs, >100 services, 12 ops staff.

Problems: vendor lock‑in (IOE), network instability, capacity limits, single‑point services, manual deployment, low monitoring coverage, DB pressure, security gaps.

3. Bronze Age (2014.7‑2015.12)

Scale: multiple IDCs, >150 cabinets, >4000 servers/VMs, >200 services, 35 ops staff.

Problems: low standardization, high maintenance cost, difficult capacity expansion, reliance on IOE, single‑point services, diverse failures, resource inefficiency.

4. Iron Age (2016.1‑present)

Scale: multiple IDCs, >200 cabinets, >6000 servers/VMs, >200 services, 43 ops staff.

Problems addressed: monitoring quantification, diversified machine packages, high operational cost, workflow automation, low resource utilization, incident‑response planning.

Review Summary

Infrastructure planning: IDC migration, multi‑site three‑center design, reserve cabinet capacity, KVM‑based private cloud, Docker containers for micro‑services.

Monitoring & alarm: Zabbix‑based server‑proxy‑client architecture, unified alarm platform with severity‑based routing, alarm convergence reducing SMS volume from 5000+ to ~800 per day.

Cost control: resource‑usage monitoring, capacity management platform, container‑as‑service, multi‑vendor procurement, internal revenue accounting.

Standardization: OS, hardware, software, architecture, component, protocol standards; logging and deployment norms.

Automation: automated provisioning, CI/CD pipeline, gray‑release, self‑service publishing.

Incident preparedness: active‑active disaster recovery, rapid switch‑over, dedicated line‑switch drills.

Overall Operations Architecture

Meizu follows a multi‑layered, high‑availability model where each business has at least two instances for services and databases, with two‑data‑center and three‑center redundancy for critical workloads.

Monitoring System

Meizu uses Zabbix as the core monitoring solution, employing a server‑proxy‑client architecture where proxies forward agent data to the server without storing it. Templates are standardized per team, and automatic CMDB‑driven host registration ensures comprehensive coverage.

Unified Alarm Platform

Alarm information from Zabbix is fed into a unified platform where severity‑based rules perform alarm convergence and escalation, reducing noise and SMS costs.

OS Standardization Inspection System

Incidents such as ineffective netfilter settings prompted the creation of an OS‑level inspection system that automatically detects non‑standard hosts and triggers remediation.

System Security Checks

Regular checks cover root‑privilege accounts, empty passwords, file permissions, firewall rules, TCP syncookies, and other hardening measures.

Process Management

Lifecycle management for servers (provisioning, operation, decommission) is modeled as atomic‑to‑composite workflows involving multiple departments, aiming to lower development cost through modular process design.

Flyme Operations Cost System

Internal revenue accounting quantifies ROI for each business, guiding budgeting, capacity planning, and resource allocation.

Future Outlook

Meizu aims to build a refined operations ecosystem that integrates automation, monitoring, process, and security management, while leveraging open platforms and big‑data services to deliver higher‑quality business support.

monitoringAutomationoperationsHigh Availabilityinfrastructurecost control
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.