How SF Express Reimagined IT Operations: From Silos to DevOps Automation
This article chronicles SF Express’s journey of transforming its IT operations department—from a fragmented, silo‑based structure to a streamlined, DevOps‑driven organization—highlighting the challenges of excessive processes, the five “sins” of traditional ops, the strategic initiatives undertaken, and the lessons learned for achieving efficient, automated, and self‑service infrastructure management.
1. Operations Enclosure
1.1 Walls and Locks of the Enclosure
Since the establishment of SF Express’s technical operations department in 2007, the team grew to nearly 200 people by 2016. To build professional technical capability, from 2013 a three‑year effort stabilized the organization and functions:
Each professional domain—network, storage, servers, operating system, databases, middleware—is owned by a dedicated line team responsible for planning, design, construction, implementation, and daily operation.
Infrastructure architects coordinate external delivery through a workflow‑driven ticket system.
The operations planning team defines policies, processes, and quality standards to compensate for management weaknesses.
The whole team follows the ITIL framework as a guiding principle.
All foundational software became fully open‑source in 2015.
Through specialized division, many experts were cultivated and standards for infrastructure, equipment, software, and architecture were established, improving resource efficiency and system stability.
After three years of governance, the structure became stable, but new problems emerged:
Responsibility‑driven KPIs cause teams to shift blame and hinder seamless collaboration, especially in vertically siloed groups.
Strict security segmentation forces lower‑level teams to wait for authorizations, slowing routine work.
Specialized division leads to fragmented skill sets; engineers often need multiple specialists to solve a single issue.
Teams hit a growth ceiling, unable to perceive broader challenges.
Vision ceiling: each team receives filtered information, limiting analysis.
Capability fragmentation: no team has full‑stack operations ability or a holistic view.
1.2 Storm Outside the Enclosure
While the operations team works at its own pace inside the “closed room,” the external business environment is rapidly changing:
Business side:
Peak traffic grows year over year, especially during Double‑Eleven.
Service models shift from internal enterprise users to direct C‑end and B‑end customers.
Frequent business adjustments translate into more frequent version releases and changes for operations.
Technical side:
Cloud maturity reduces the need for self‑built operations teams, shrinking the market pool.
Rapid open‑source evolution makes many traditional commercial technologies obsolete, threatening engineers who do not upskill.
DevOps opens a more efficient path but demands that operations staff acquire development capabilities.
2. Operations Judgment Day
We divided IT operations work into four quadrants (illustrated below). Ideally, resources should focus on the right‑hand quadrants that deliver value, yet about 70 % of effort is consumed by routine tasks in the left‑hand quadrants.
From this analysis we identified the “Five Sins of Operations”:
2.1 Cumbersome Proficiency
After three years of specialization, engineers become extremely fast at routine tasks but lose independent thinking.
2.2 Dimensionality‑Reduced Efficiency
Simple tasks that could take minutes are delayed for days due to multi‑layer approvals, especially in siloed teams.
2.3 The Black Hole of Introspection
Operations teams sit at the end of the value chain, lacking front‑line insight, which breeds negative self‑perception.
2.4 Self‑Imposed Chains
Over‑execution of KPIs, standards, processes, and budgets initially brings order but later becomes restrictive shackles, stifling creativity and lagging behind change.
2.5 Automation Shortcomings
Automation efforts focused on internal execution rather than user‑centric delivery, resulting in limited impact on overall workflow efficiency.
3. The Dream of Operations
Reflection revealed four aspirations:
Information must flow and be shared, with appropriate security filtering, so both operators and users operate on the same data plane.
Delivery should be end‑to‑end automated and self‑service, with compliance enforced by embedded rule engines.
Key technical capabilities should be offered as services, decoupling inter‑department dependencies.
Routine events and anomalies should become self‑reactive and self‑healing, reducing cost and workload while improving availability.
4. Planning
We identified six breakthrough directions, including redefining technical skill requirements, providing self‑service portals, forming full‑stack operations teams, abstracting ITIL controls into rule logic, building user‑centric automation, and making storage programmable and X86‑based.
Five concrete tasks were launched:
“FengBox”: visual container self‑service platform.
“FengYun”: visual KVM cluster self‑service platform.
“WeiShi”: the brain of SF’s automated operations, delivering information flow and rule application.
“ThinkDB”: high‑availability X86‑based database pool, reducing SAN dependency.
“OSS”: programmable object storage system, replacing NAS.
The “WeiShi” project planned to deliver self‑service by early April 2017 and enter optimization by July.
5. Hitting Walls
When the newly formed development team (mostly Java engineers without operations experience) started, they quickly encountered a two‑month cycle of endless problems:
Overwhelming demand from all specialty groups.
Misaligned understanding between operations and development.
Inappropriate use of product and agile methods.
Role confusion and over‑effort.
Discrepancy between meeting discussions and actual outcomes.
Result: team fatigue, attrition, and stalled progress.
6. Facing the Wall
Key personnel reflected and established five rules:
Standardize demand, prioritize value.
Bring demand owners and developers together.
Focus on essence, avoid trendy concepts.
Align words and actions, ensure consistent communication.
Clarify and respect responsibilities.
7. Breaking the Wall
The “WeiShi” design emerged, based on programmable interfaces from KVM, Docker, OSS, and ThinkDB, encapsulating hardware as atomic services, and providing orchestration, task scheduling, authentication, and self‑service modules.
8. Present and Future
Current achievements include:
More effective demand discovery and management.
Improved code quality.
Automated testing framework.
Operations‑friendly technology stack.
Streamlined task organization.
We anticipate reaching partial self‑adaptive and self‑healing operations within a year.
9. The Freedom of Operations
Ultimately, operations staff should enjoy freedom—free from constant fear of missed deadlines or system failures—so they can work elegantly and efficiently.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.