How to Seamlessly Take Over New Ops Responsibilities: A Practical Checklist
This guide outlines a step‑by‑step approach for taking over new operational responsibilities, covering communication with development leaders, business overview, asset inventory, basic and business‑specific monitoring, standardization, SOP creation, failure drills, cost and capacity planning, and effective cross‑team communication.
This article summarizes practical experience and processes for taking over new operations work, presented concisely and directly.
Tell the Ugly Truth First
Communicate with the development leader to set expectations: ops will focus on safety, stability, low cost, and rapid iteration, not babysitting. Development handles their own environments; ops provides professional consulting and avoids direct control of development changes.
Understand the Business Overview
Identify key contacts (developers, testers, product managers) and store their information. Learn what the service does, its problem domain, any open‑source equivalents, upstream/downstream dependencies, deployment locations, technology stack, network reliability, past infrastructure issues, and current pain points.
Business Walkthrough
Ask developers (or the previous ops engineer) to prepare a PPT covering deployment topology, overall architecture, data flow, testing change process, monitoring methods, machine locations, login methods, module details, OS tuning, third‑party software choices, relevant wikis, failure‑handling plans, common issues, and current online problems.
Asset Inventory
Catalog domains, virtual IPs, associated services, machines, modules, data‑center locations, bandwidth usage, and shared resources. Gather detailed machine information (configuration, rack position, IPs, management cards). If a CMDB does not exist, build one. Consider backup machines and hardware uniformity.
Basic Monitoring
Implement monitoring for domain connectivity and latency, virtual IP health, machine uptime, hardware status, critical system processes (sshd, crond), total process count, and system parameter configurations. Refer to the article “What Comprehensive Monitoring Should Cover”.
Service Inventory
From architecture and data‑flow diagrams, understand each module’s deployment details: host machines, directories, launch accounts, log locations, programming language, deployment method, resource consumption (CPU, memory, disk, I/O), threshold settings, watchdog requirements, log keyword alerts, and other operational considerations.
Business‑Specific Monitoring
Beyond generic metrics, monitor process/port health, resource utilization, log keyword alerts, log rotation, and service‑specific indicators. Plan for API‑level monitoring to drive business optimization.
Standardization and Automation
Unify machine naming, OS distribution, OS versions, and third‑party software (JDK, Tomcat, Nginx). Implement one‑click deployment for scaling, changes, and decommissioning, allowing developers to trigger releases while controlling permissions. Script repetitive tasks, create self‑healing scripts triggered by alerts, and build foundational infrastructure (service discovery, MQ, logging platforms) if absent.
SOP Development
Document fault pre‑plans detailing potential failures and step‑by‑step remediation procedures, enabling calm and accurate response during incidents.
Failure Drills
Validate pre‑plans through controlled failure simulations (e.g., module or machine outages). While some large‑scale network failure drills may be impractical, regular drills improve stability.
Advanced Business Monitoring
Track service‑specific metrics such as MQ message backlog, RPC interface latency and success rates, and S3 bucket bandwidth usage.
API Success Rate and Latency Statistics
Collect API success rates and latency at the Nginx entry point, identify top‑N low‑success or high‑latency endpoints, and drive optimization efforts.
Online Issue Triage
Catalog all online issues, resolve those within ops scope, hand off others to development with scheduled fixes, and report weekly progress and pending items.
Cost Optimization
Consolidate services and use a unified resource scheduling platform to reduce machine costs, potentially saving tens of thousands per server.
Capacity Planning
Plan for bandwidth, dedicated lines, network equipment, and machines based on growth trends and operational needs, reallocating idle resources to other services to maximize utilization.
Effective Communication
Maintain detailed meeting minutes, email summaries, and clear ownership of action items. Follow up on checkpoints, document delays, and involve senior leadership when communication breaks down.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.