Tagged articles
3287 articles
Page 30 of 33
Efficient Ops
Efficient Ops
Apr 18, 2017 · Operations

Boost Mobile Game Performance: Ops, Download & Real‑Time Network Hacks

This article outlines a comprehensive solution for mobile game operations, covering the value of modern ops, user‑experience metrics across download, login, gameplay, payment and sentiment, download‑service optimizations such as domain and resource hijack protection, incremental updates, and real‑time battle network enhancements including access‑network, backbone and QoS techniques.

Download OptimizationMobile GamingOperations
0 likes · 23 min read
Boost Mobile Game Performance: Ops, Download & Real‑Time Network Hacks
Continuous Delivery 2.0
Continuous Delivery 2.0
Apr 16, 2017 · Operations

Baidu's Traditional Application Operations and Branch Management Process

The article explains Baidu's traditional project branch management approach, the reasons behind mainline release queues, and summarizes the team's continuous delivery transformation, highlighting clear goals, transparent planning, self‑defined processes, story‑driven development, six‑step CI, and automated testing practices.

BaiduBranch ManagementContinuous Delivery
0 likes · 6 min read
Baidu's Traditional Application Operations and Branch Management Process
ITPUB
ITPUB
Apr 15, 2017 · Operations

How to Configure Nginx Load Balancing with Multiple Tomcat Instances on Windows

This step‑by‑step guide shows how to prepare two Tomcat servers, create a simple web project, configure Nginx as a reverse‑proxy load balancer with various strategies, start the services on Windows, and verify that requests are distributed across the Tomcat instances.

BackendOperationsTomcat
0 likes · 6 min read
How to Configure Nginx Load Balancing with Multiple Tomcat Instances on Windows
21CTO
21CTO
Apr 13, 2017 · Operations

Mastering Internet Performance Engineering and Capacity Planning

This article presents a comprehensive methodology for internet performance engineering, covering non‑functional quality goals, detailed metrics for application servers, databases, caches and message queues, a practical technical review outline, and a real‑world capacity‑planning case study with both maximal and minimal resource solutions.

Backend ArchitectureNon-functional RequirementsOperations
0 likes · 24 min read
Mastering Internet Performance Engineering and Capacity Planning
Architecture Digest
Architecture Digest
Apr 13, 2017 · Operations

Methodology for Internet Architecture Technical Review and Capacity/Performance Evaluation

This article presents a comprehensive methodology for reviewing internet‑scale system architectures, focusing on non‑functional quality attributes such as performance, availability, scalability, security, and maintainability, and provides detailed guidelines, metrics tables, and a classic case study for capacity and performance planning.

BackendNon-functional RequirementsOperations
0 likes · 27 min read
Methodology for Internet Architecture Technical Review and Capacity/Performance Evaluation
Efficient Ops
Efficient Ops
Apr 12, 2017 · Operations

Mastering Enterprise Monitoring: From Basics to Advanced Toolchains

This comprehensive guide explains why monitoring is vital for operations, outlines clear objectives and methods, compares popular open‑source and commercial tools, details a Zabbix‑based workflow, and covers hardware, system, application, network, security, API, performance, and business metrics with practical alerting strategies.

Operationsalertingmonitoring
0 likes · 21 min read
Mastering Enterprise Monitoring: From Basics to Advanced Toolchains
ITPUB
ITPUB
Apr 4, 2017 · Operations

Real‑World Ops Pitfalls and Proven Ways to Avoid Them

This article compiles practical experiences from system administrators about common operational pitfalls, their root causes, and concrete mitigation steps, ranging from misconfigured HAProxy timeouts and risky rm commands to ansible async quirks and cron‑job failures.

AnsibleDevOpsLinux
0 likes · 8 min read
Real‑World Ops Pitfalls and Proven Ways to Avoid Them
Efficient Ops
Efficient Ops
Mar 30, 2017 · Operations

Why Ops Engineers Are Always the Scapegoat—and How to Turn That Into Value

The article reflects on the challenges faced by operations engineers in small companies, illustrating why they often become scapegoats, and offers practical advice on learning, risk control, communication, and disaster‑recovery drills to increase their value and effectiveness.

Operationslearningrisk management
0 likes · 18 min read
Why Ops Engineers Are Always the Scapegoat—and How to Turn That Into Value
dbaplus Community
dbaplus Community
Mar 29, 2017 · Operations

Why Does Server IO Spike at 3 AM? Diagnose RAID Battery and Self‑Test Issues

This guide explains why server IO utilization spikes above 60% during early‑morning hours, covering hardware self‑test, RAID battery failures, cache policy misconfigurations, and step‑by‑step commands for MegaRAID and HP servers, plus BIOS adjustments and best‑practice recommendations to prevent performance degradation.

HardwareMegaCliMySQL
0 likes · 16 min read
Why Does Server IO Spike at 3 AM? Diagnose RAID Battery and Self‑Test Issues
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 29, 2017 · Operations

How Alibaba Built the ‘Nuclear Weapon’ Full‑Link Stress Test for Double 11

This article chronicles Alibaba's evolution of the full‑link pressure testing platform—from its 2013 inception tackling massive Double 11 traffic, through data construction, isolation, traffic generation, and platform upgrades—to a mature, automated, cloud‑native solution that safeguards large‑scale e‑commerce stability.

AlibabaOperationscapacity planning
0 likes · 13 min read
How Alibaba Built the ‘Nuclear Weapon’ Full‑Link Stress Test for Double 11
Efficient Ops
Efficient Ops
Mar 28, 2017 · Operations

How We Scaled Server Authentication with OpenLDAP: A Real‑World Operations Journey

This article walks through a vehicle‑networking company's four‑stage journey—selection, requirement analysis, implementation, and evolution—to replace fragmented SSH passwords with a centralized OpenLDAP authentication platform, covering cost decisions, deployment steps, security hardening, and management automation.

AuthenticationOpenLDAPOperations
0 likes · 13 min read
How We Scaled Server Authentication with OpenLDAP: A Real‑World Operations Journey
Baidu Intelligent Testing
Baidu Intelligent Testing
Mar 27, 2017 · Operations

Gray Release (Canary Deployment) Strategies and Practices

The article explains gray release as a smooth, risk‑mitigating deployment method, outlines why it is needed, describes its limitations, and compares four practical gray‑release solutions—including code‑level flags, pre‑release machines, SET isolation, and dynamic routing—before recommending a combined approach.

Deployment StrategyGray ReleaseOperations
0 likes · 11 min read
Gray Release (Canary Deployment) Strategies and Practices
DevOps
DevOps
Mar 26, 2017 · Operations

DevOps Survey Findings: Adoption Rates, Benefits, Challenges, and Tool Usage

Based on a survey of 300 IT professionals, this report reveals growing DevOps adoption, key motivations such as quality and cost reduction, major obstacles like resource shortages, measurable benefits including cost savings and faster releases, preferred tools, error‑handling practices, and future investment plans.

ChallengesDevOpsOperations
0 likes · 11 min read
DevOps Survey Findings: Adoption Rates, Benefits, Challenges, and Tool Usage
MaGe Linux Operations
MaGe Linux Operations
Mar 23, 2017 · Operations

Why Operations Engineering Is the Hottest Career Path in 2024

The article reflects on eight years of operations experience, highlights the bright industry outlook, and outlines four key career paths—operations development, platform R&D, database engineering, and management—showing why skilled ops engineers are increasingly in demand.

IT jobsOperations
0 likes · 5 min read
Why Operations Engineering Is the Hottest Career Path in 2024
DevOps
DevOps
Mar 21, 2017 · Operations

DevOps Evolution: Software Engineering Development, Transformation Pitfalls, Core Practices, and Ecosystem

This article traces the evolution of software engineering tools leading to DevOps, highlights common transformation pitfalls, outlines core DevOps practices such as autonomous small teams, traceable toolchains, real‑time metrics, and describes the surrounding ecosystem, offering practical guidance for organizations adopting DevOps.

Continuous DeliveryDevOpsOperations
0 likes · 19 min read
DevOps Evolution: Software Engineering Development, Transformation Pitfalls, Core Practices, and Ecosystem
Baidu Intelligent Testing
Baidu Intelligent Testing
Mar 21, 2017 · Operations

Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details

This article presents a comprehensive server‑side monitoring solution covering functional and performance requirements, monitoring objects, design choices between self‑monitoring and centralized reporting, system architecture, API definitions, key challenges such as key collisions, data formats, storage options, and operational considerations.

MetricsOperationsalerting
0 likes · 12 min read
Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details
DevOps
DevOps
Mar 20, 2017 · Operations

What DevOps Really Is (and Isn’t): History, Principles, Tools, and Culture

This article explains the origins and background of DevOps, clarifies common misconceptions about its role and title, outlines its cultural principles, surveys the essential toolchain, and discusses how organizations can adopt DevOps practices beyond just development and operations.

Continuous DeliveryCultureDevOps
0 likes · 13 min read
What DevOps Really Is (and Isn’t): History, Principles, Tools, and Culture
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Mar 20, 2017 · Operations

How 360’s DoctorStarange Boosts Ops with AI‑Driven Prediction, Correlation, and Resource Optimization

This article explains how 360’s DoctorStarange system combines time‑series forecasting, neural‑network predictions, alarm correlation, and a machine‑health scoring model to reduce false alerts, automate remediation, and maximize resource utilization across thousands of production servers.

ARIMAOperationsPredictive Monitoring
0 likes · 14 min read
How 360’s DoctorStarange Boosts Ops with AI‑Driven Prediction, Correlation, and Resource Optimization
High Availability Architecture
High Availability Architecture
Mar 15, 2017 · Operations

Highlights from SRECon17 Americas 2023 in San Francisco

The article reports on the SRECon17 Americas conference in San Francisco, summarizing keynote talks, panel sessions, and practical insights from industry leaders such as Stripe, Netflix, Google, and IBM on topics ranging from traffic control and container management to on‑call practices and cost considerations for Site Reliability Engineering.

DevOpsGoogleNetflix
0 likes · 6 min read
Highlights from SRECon17 Americas 2023 in San Francisco
Efficient Ops
Efficient Ops
Mar 12, 2017 · Operations

How Tencent Saved 8 Million QQ Users by Migrating Legacy Services

This article recounts how Tencent's operations team tackled the urgent migration of aging data‑center infrastructure to preserve service for 8 million legacy QQ users, detailing the challenges, strategic choices, IP‑level network relocation, and the DevOps practices that ensured a successful cut‑over.

Legacy MigrationOperationsTencent
0 likes · 15 min read
How Tencent Saved 8 Million QQ Users by Migrating Legacy Services
ITPUB
ITPUB
Mar 9, 2017 · Operations

How the Four‑Eyes Principle Saves IT Ops from Costly Mistakes

The article shares frontline IT operations experiences, emphasizing careful command execution, mandatory operation logs, two‑person verification, and backup strategies to prevent disastrous errors, illustrated by real incidents like a massive Deutsche Bank loss caused by a simple input mistake.

IT best practicesIncident PreventionOperations
0 likes · 4 min read
How the Four‑Eyes Principle Saves IT Ops from Costly Mistakes
MaGe Linux Operations
MaGe Linux Operations
Mar 8, 2017 · Operations

Master Linux ‘top’ Command: Real‑Time Process Monitoring Guide

This article explains how to use the Linux top command for real‑time system and process monitoring, covering its interface, statistical and process sections, interactive shortcuts, command‑line options, and internal commands to customize and sort the displayed information.

Operationsprocess managementsystem-monitoring
0 likes · 8 min read
Master Linux ‘top’ Command: Real‑Time Process Monitoring Guide
DevOps
DevOps
Mar 5, 2017 · Operations

Controlling Work‑in‑Progress: Delay Start and Focus on Completion

The article explains how to control work‑in‑progress by postponing new starts and concentrating on finishing existing tasks, emphasizing that WIP should be measured in delivered user value rather than task count, and outlines practical control techniques for lean product development.

KanbanLeanOperations
0 likes · 7 min read
Controlling Work‑in‑Progress: Delay Start and Focus on Completion
Architecture Digest
Architecture Digest
Mar 3, 2017 · Operations

High-Concurrency Architecture: Strategies, Testing, and Practical Solutions

This article outlines the design and implementation of high‑concurrency systems, covering server architecture, load balancing, database clustering, caching strategies, message‑queue based asynchronous processing, static data handling, and operational best practices such as monitoring, redundancy, and automation.

CachingMessage queueOperations
0 likes · 18 min read
High-Concurrency Architecture: Strategies, Testing, and Practical Solutions
DevOps
DevOps
Feb 28, 2017 · Operations

Designing a Team Kanban Wall and System: Step-by-Step Guide

This article walks readers through a three-step process for designing a team’s Kanban wall and system, teaching how to analyze value streams, select appropriate visual elements, and create a customized board that supports efficient workflow management.

KanbanOperationsProcess Design
0 likes · 3 min read
Designing a Team Kanban Wall and System: Step-by-Step Guide
Efficient Ops
Efficient Ops
Feb 28, 2017 · Operations

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.

Incident ResponseOperationscapacity planning
0 likes · 18 min read
Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response
Efficient Ops
Efficient Ops
Feb 26, 2017 · Operations

How Alibaba Scales Massive Data Platforms: Lessons in Automated Operations

This article explores the challenges of operating Alibaba's large‑scale data platforms, describes the automation platform built to address them, and shares data‑driven, fine‑grained operational practices that enable stable, efficient, and cost‑effective service delivery.

Big DataOperationsPlatform
0 likes · 22 min read
How Alibaba Scales Massive Data Platforms: Lessons in Automated Operations
DevOps
DevOps
Feb 23, 2017 · Operations

Comparing ITIL and DevOps: Principles, Automation, and Integration Models

The article examines the conflict and convergence between ITIL and DevOps in modern operations, outlining DevOps principles, automation in deployment and operations, and three integration models that balance management and execution, while highlighting the distinct values and scenarios for each approach.

Continuous DeliveryDevOpsITIL
0 likes · 12 min read
Comparing ITIL and DevOps: Principles, Automation, and Integration Models
Efficient Ops
Efficient Ops
Feb 21, 2017 · Mobile Development

How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes

This article details Alibaba's mobile app operational practices, covering the challenges of client-side maintenance, their high‑frequency release pipeline, gray‑release mechanisms, monitoring, trace systems, remote logging, and rapid issue resolution to ensure stability and performance at massive scale.

Gray ReleaseMobileOperations
0 likes · 21 min read
How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes
Ctrip Technology
Ctrip Technology
Feb 16, 2017 · Operations

Application‑Based Automated Capacity Management and Utilization Evaluation

The article presents a comprehensive, application‑centric approach to automated capacity management that analyzes why server utilization is low, defines safe usage thresholds, describes a load‑balancer‑driven stress‑testing workflow with regression modeling, and explains how this practice improves resource efficiency, cost savings, and developer‑ops collaboration.

DevOpsOperationsautomation
0 likes · 14 min read
Application‑Based Automated Capacity Management and Utilization Evaluation
Efficient Ops
Efficient Ops
Feb 15, 2017 · Operations

Mastering the One‑Second Rule: Boost Mobile User Experience in 2024

This article explains how mobile network characteristics, the one‑second rule, and targeted optimizations in access scheduling, protocols, and business logic can dramatically improve download success, startup speed, and overall user experience for mobile services.

MobileOperationsnetwork
0 likes · 24 min read
Mastering the One‑Second Rule: Boost Mobile User Experience in 2024
Qunar Tech Salon
Qunar Tech Salon
Feb 14, 2017 · Operations

Application‑Based Automated Capacity Management and Utilization Evaluation

This article explains how to automate application‑centric capacity assessment, identify the safe utilization thresholds, use load‑balancer‑driven stress testing and regression modeling to pinpoint resource bottlenecks, and improve server usage while maintaining service reliability through close DevOps collaboration.

DevOpsOperationsautomation
0 likes · 15 min read
Application‑Based Automated Capacity Management and Utilization Evaluation
转转QA
转转QA
Feb 13, 2017 · Databases

Redis Connection Pool Saturation: A Debugging Tale

A developer recounts how a Redis connection pool overflow across dozens of clusters was traced to a single misbehaving service, diagnosed with netstat and ps commands, and resolved by adjusting configuration and stopping the offending process, illustrating practical troubleshooting of connection limits.

Connection PoolOperationsRedis
0 likes · 4 min read
Redis Connection Pool Saturation: A Debugging Tale
Efficient Ops
Efficient Ops
Feb 9, 2017 · Operations

Automating Application‑Based Capacity Management to Boost Resource Utilization

This article explains how to automate capacity management focused on application performance, identifies common causes of low resource utilization, proposes safe utilization thresholds, describes a testing framework that uses load‑balancer weighting and real‑time monitoring to pinpoint bottlenecks, and outlines how ops and developers can collaborate to improve efficiency.

Operationsautomationcapacity management
0 likes · 18 min read
Automating Application‑Based Capacity Management to Boost Resource Utilization
Efficient Ops
Efficient Ops
Feb 6, 2017 · Operations

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

The article shares Tencent’s practical fault‑tolerance journey for a billion‑scale activity platform, covering retry strategies, automatic removal of faulty nodes, timeout tuning, business‑level safeguards, service degradation, and decoupling techniques that together reduce manual firefighting and improve system resilience.

Operationsfault tolerancelarge-scale systems
0 likes · 25 min read
Building Billion‑Scale Web Systems That Auto‑Extinguish Failures
21CTO
21CTO
Feb 2, 2017 · Operations

What GitLab’s 300 GB Data Loss Teaches About Backup and Ops Discipline

The GitLab production database was mistakenly deleted during a manual fix, exposing gaps in backup strategies, PostgreSQL configuration, and operational practices, and prompting a detailed post‑mortem that highlights the need for automated recovery, proper tooling, and transparent incident handling.

Data lossDatabase BackupIncident Response
0 likes · 15 min read
What GitLab’s 300 GB Data Loss Teaches About Backup and Ops Discipline
Efficient Ops
Efficient Ops
Jan 24, 2017 · Databases

Essential DBA Holiday Checklist: Keep Your Databases Safe During Chinese New Year

This guide outlines the critical tasks DBA teams should perform before, during, and after the Chinese New Year holiday, including daily security practices, pre‑holiday inspections, on‑call rotations, post‑holiday reviews, and detailed checklist scripts to ensure database reliability and prevent incidents.

DBADatabasesHoliday
0 likes · 13 min read
Essential DBA Holiday Checklist: Keep Your Databases Safe During Chinese New Year
Efficient Ops
Efficient Ops
Jan 22, 2017 · Operations

What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

The 2016 Ops Alert Report reveals Zabbix’s dominance, preferred notification channels, monthly and daily alert trends, peak alert times, regional distribution, and quirky usage statistics, offering valuable insights for operations teams to optimize monitoring and incident response.

Operationsalertsincident-management
0 likes · 5 min read
What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

Building a Scalable Business Monitoring System: Architecture, Modules & Lessons

This article presents a comprehensive case study of a business monitoring system, covering its background, architectural analysis, module design, time‑series database selection, visualization with Grafana, alerting strategies, decision‑making logic, and intelligent monitoring experiments, followed by key takeaways and lessons learned.

GrafanaInfluxDBOperations
0 likes · 12 min read
Building a Scalable Business Monitoring System: Architecture, Modules & Lessons
MaGe Linux Operations
MaGe Linux Operations
Jan 8, 2017 · Operations

Master Ansible: From Basics to Advanced Modules for Efficient Operations

This guide introduces Ansible for operations, covering its core features, installation, host preparation, key management, essential modules, playbook structure, YAML syntax, handlers, tags, variables, templates, loops, and conditional execution, with practical command examples and visual illustrations.

AnsibleConfiguration ManagementDevOps
0 likes · 8 min read
Master Ansible: From Basics to Advanced Modules for Efficient Operations
Efficient Ops
Efficient Ops
Jan 8, 2017 · Operations

Why Global Server Load Balancing (GSLB) Is Hard: Technical Challenges and Solutions

This article explains what GSLB (Global Server Load Balancing) is, why achieving high availability, low latency, and accurate traffic distribution is difficult due to DNS limitations, caching, and routing constraints, and explores architectural and network‑level techniques such as feedback loops, anycast, and BGP routing to mitigate these challenges.

AnycastDNSGSLB
0 likes · 16 min read
Why Global Server Load Balancing (GSLB) Is Hard: Technical Challenges and Solutions
DevOps
DevOps
Jan 4, 2017 · Operations

The Third Way of DevOps: Continuous Learning and Docker as Lab Equipment

The article explains the Third Way of DevOps—continuous learning through Kaizen and the PDSA cycle—showing how Docker serves as laboratory equipment that enables rapid, reproducible experiments, illustrated with examples from a financial institution and a personal baseball‑statistics project.

DevOpsDockerLean
0 likes · 8 min read
The Third Way of DevOps: Continuous Learning and Docker as Lab Equipment
21CTO
21CTO
Jan 4, 2017 · Operations

How to Build Truly High‑Availability Systems: Principles and Practices

This article explains what high availability means for distributed systems, outlines common availability tiers, and describes how redundancy, load balancing, and automatic failover across a typical Internet architecture can achieve reliable, scalable services.

Distributed SystemsOperationsReliability
0 likes · 6 min read
How to Build Truly High‑Availability Systems: Principles and Practices
DevOps
DevOps
Jan 3, 2017 · Operations

Applying the DevOps “Second Way” with Docker: Accelerating Feedback Loops

This article explains the DevOps “Second Way,” emphasizing faster, bidirectional feedback loops, and shows how Docker’s immutable containers, streamlined packaging, and embedded metadata reduce variation, accelerate defect detection, and shorten lead times in service delivery.

Continuous DeliveryDevOpsDocker
0 likes · 7 min read
Applying the DevOps “Second Way” with Docker: Accelerating Feedback Loops
Efficient Ops
Efficient Ops
Dec 29, 2016 · Operations

Meet the Top Operations Experts of 2016: Profiles and Must‑Read Articles

This article introduces the standout operations professionals featured by the High‑Efficiency Operations community in 2016, summarizing each expert’s background, key achievements, and a curated list of their most influential technical articles for readers seeking deep insights into modern ops practices.

Cloud ComputingOperationsautomation
0 likes · 12 min read
Meet the Top Operations Experts of 2016: Profiles and Must‑Read Articles
Efficient Ops
Efficient Ops
Dec 28, 2016 · Operations

Transforming Financial Application Operations: Lessons from a European Rollout

This article shares a detailed case study of how a financial services team restructured European application operations, applied lean retrospectives, built a top‑down monitoring system, and introduced systematic stakeholder collaboration to dramatically improve incident response, system robustness, and user satisfaction.

DevOpsIncident ManagementOperations
0 likes · 14 min read
Transforming Financial Application Operations: Lessons from a European Rollout
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Dec 27, 2016 · Operations

How Dangdang Scaled Its E‑Commerce Platform for 10× Traffic Peaks

This article details Dangdang's 15‑year evolution from a monolithic system to a distributed, SOA‑based architecture, outlining the challenges of high‑traffic e‑commerce events and the strategies—system grading, decoupling, asynchronous processing, batching, and rate limiting—used to achieve reliable, scalable operations.

OperationsSOAe‑commerce
0 likes · 19 min read
How Dangdang Scaled Its E‑Commerce Platform for 10× Traffic Peaks
Efficient Ops
Efficient Ops
Dec 26, 2016 · Operations

How Tencent Scaled Social Data Storage While Cutting Costs

Facing massive user growth, Tencent’s social network team redesigned its KV storage architecture—introducing CKV and Grocery, automating capacity planning, data migration, and backup reuse—to dramatically lower costs, improve operational efficiency, and maintain high service quality across millions of devices.

OperationsScalabilityautomation
0 likes · 21 min read
How Tencent Scaled Social Data Storage While Cutting Costs
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 26, 2016 · Operations

How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions

Alibaba’s SunFire platform delivers massive‑scale, real‑time log collection, processing, and visualization for e‑commerce spikes like Double 11, using low‑overhead agents, asynchronous Map/Reduce pipelines, fault‑tolerant task scheduling, and shared inputs to ensure accurate, low‑latency monitoring across billions of transactions.

AlibabaOperationslog analysis
0 likes · 18 min read
How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions
Efficient Ops
Efficient Ops
Dec 21, 2016 · Operations

Measure Your Continuous Delivery Maturity with a 47‑Item Checklist

Learn how to assess your Continuous Delivery maturity using a 47‑item checklist, understand its purpose for aligning goals, improving processes, and boosting value delivery, and calculate your score as a percentage to guide technical and organizational improvements.

Operationsmaturity checklistsoftware delivery
0 likes · 2 min read
Measure Your Continuous Delivery Maturity with a 47‑Item Checklist
Efficient Ops
Efficient Ops
Dec 19, 2016 · Operations

What 16 Major 2016 Outages Teach Us About Disaster Recovery

This article reviews sixteen notable 2016 service outages across finance, cloud, and entertainment, analyzes their causes—ranging from power failures to DDoS attacks—and highlights the critical need for robust disaster‑recovery and information‑security practices.

Incident ManagementInformation SecurityOperations
0 likes · 11 min read
What 16 Major 2016 Outages Teach Us About Disaster Recovery
DevOps
DevOps
Dec 18, 2016 · Operations

Introduction to DevOps and Docker: Concepts, Components, and Implementation

This article explains the principles of DevOps, its technical, process, and organizational considerations, and introduces Docker as a key tool, detailing its architecture, components, native utilities, suitable scenarios, and how it enables continuous integration, delivery, and efficient operations.

DevOpsDockerOperations
0 likes · 14 min read
Introduction to DevOps and Docker: Concepts, Components, and Implementation
DevOps
DevOps
Dec 13, 2016 · Operations

DevOps Is Not About Automation Tools, But They Are a Prerequisite

DevOps is a methodology that emphasizes collaboration between development and operations to accelerate software delivery, and while tools alone don’t constitute DevOps, automation and container technologies are essential prerequisites that reduce manual hand‑offs, enable self‑service, and improve feedback loops.

Continuous DeliveryDevOpsOperations
0 likes · 7 min read
DevOps Is Not About Automation Tools, But They Are a Prerequisite
DevOps
DevOps
Dec 11, 2016 · Operations

The Evolution of DevOps: From Agile Foundations to CALMS, Containerization, and Enterprise Best Practices

From its origins at the 2008 Agile conference to the modern CALMS framework, this article traces DevOps’s evolution, compares traditional, DevOps 1.0 and 2.0 approaches, and outlines key Chinese practices such as containers, continuous deployment, micro‑services, and enterprise best‑practice recommendations.

CALMSContinuous DeliveryDevOps
0 likes · 11 min read
The Evolution of DevOps: From Agile Foundations to CALMS, Containerization, and Enterprise Best Practices
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Dec 8, 2016 · Operations

How CAT Enables Scalable Real‑Time Monitoring for Distributed Systems

This article introduces CAT, an open‑source Java‑based distributed real‑time monitoring platform, detailing its design goals, architecture, message processing pipeline, logging instrumentation, API, real‑time analysis, report modeling, storage challenges, and key takeaways for building highly available, scalable monitoring solutions.

Distributed MonitoringOperationsSystem architecture
0 likes · 13 min read
How CAT Enables Scalable Real‑Time Monitoring for Distributed Systems
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 7, 2016 · Operations

How Alibaba Automates Its Network for Double 11 Traffic Surges

This article outlines Alibaba researcher Zhang Ming’s presentation on the network automation system that enables Alibaba’s infrastructure to handle the massive traffic and rapid fault recovery required during the Double 11 shopping festival, highlighting the challenges, detection methods, and automated tools used across routers, switches, and L4‑L7 devices.

AlibabaOperationsfault detection
0 likes · 3 min read
How Alibaba Automates Its Network for Double 11 Traffic Surges
Efficient Ops
Efficient Ops
Dec 5, 2016 · Operations

From PHP Monolith to Java Microservices: Mogujie's Ops Evolution and Lessons

This article recounts Mogujie's journey from a small PHP‑based LNMP stack to a Java‑driven micro‑service architecture, detailing the operational challenges, standardization efforts, continuous integration pipeline, and full‑link tracing techniques that enabled scalable, reliable e‑commerce services.

Full‑Link TracingJava migrationOperations
0 likes · 17 min read
From PHP Monolith to Java Microservices: Mogujie's Ops Evolution and Lessons
Efficient Ops
Efficient Ops
Dec 4, 2016 · Operations

How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center

This article details Ctrip's evolution from a single‑site call‑center to a fully dual‑active, multi‑region architecture, covering the overall system design, public network, application, and client layers, unified login mechanisms, heartbeat monitoring, and future software‑only and mobile‑first directions.

Dual-ActiveOperationsSRE
0 likes · 27 min read
How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center
Qunar Tech Salon
Qunar Tech Salon
Dec 1, 2016 · Backend Development

How to Prevent Service Failures: Suspect Third‑Party, Guard Users, and Perfect Your Own Service

The article shares practical strategies for preventing service failures by doubting third‑party services, protecting against misuse by consumers, and improving one’s own code and architecture, covering fallback plans, timeout settings, retry policies, API design, traffic control, and resource limits.

API-designOperationsReliability
0 likes · 16 min read
How to Prevent Service Failures: Suspect Third‑Party, Guard Users, and Perfect Your Own Service
Efficient Ops
Efficient Ops
Nov 27, 2016 · Operations

When Ops Heroes Burn Out: Tackling Personal Heroism in Operations

The article explores personal heroism in operations, defining it as reliance on individual effort to keep flawed systems appearing normal, examines its short‑term benefits and long‑term drawbacks for companies, teams, and the heroes themselves, and offers practical strategies to eliminate this risky mindset.

Incident ManagementOperationsSLA
0 likes · 10 min read
When Ops Heroes Burn Out: Tackling Personal Heroism in Operations
dbaplus Community
dbaplus Community
Nov 23, 2016 · Operations

How to Rapidly Deploy DCOS Services with Ansible and Docker

This guide walks through an automated, fast‑track deployment of DCOS components—including service selection, Docker‑based containers, host initialization, system checks, Ansible provisioning, Consul service discovery, HAProxy load balancing, MySQL HA, and Zookeeper/Marathon integration—providing concrete commands, configuration snippets, and practical tips.

AnsibleConsulDCOS
0 likes · 12 min read
How to Rapidly Deploy DCOS Services with Ansible and Docker
Efficient Ops
Efficient Ops
Nov 21, 2016 · Operations

7 Proven Bandwidth Optimization Strategies to Cut Social Platform Costs by 2 Billion

This article shares Tencent's seven practical bandwidth‑saving techniques—ranging from disabling auto‑play to intelligent pre‑push, file compression, on‑demand usage, segmented download, technical breakthroughs, and content compliance—to dramatically reduce operational costs while maintaining user experience.

Cost ReductionOperationsbandwidth optimization
0 likes · 9 min read
7 Proven Bandwidth Optimization Strategies to Cut Social Platform Costs by 2 Billion
dbaplus Community
dbaplus Community
Nov 20, 2016 · Operations

Top Insights from the 2016 Global Agile Operations Summit

The 2016 Global Agile Operations Summit in Shanghai concluded with a series of expert sessions covering agile DevOps trends, cloud‑native automation platforms, database performance tuning, container orchestration, and real‑world case studies from leading companies, followed by the award ceremony honoring ten MVPs who drove innovation across operations and infrastructure.

Cloud ComputingContainerDatabase
0 likes · 15 min read
Top Insights from the 2016 Global Agile Operations Summit
Qunar Tech Salon
Qunar Tech Salon
Nov 18, 2016 · Operations

Design and Implementation of Ctrip's Predictive Outbound Call Platform

This article describes Ctrip's large‑scale predictive outbound call platform, covering its underlying algorithms, SoftPBX integration, system architecture, concurrency enhancements, deployment experience, and measurable improvements in call success rates and agent efficiency.

Operationscall centeroutbound algorithm
0 likes · 8 min read
Design and Implementation of Ctrip's Predictive Outbound Call Platform
Efficient Ops
Efficient Ops
Nov 14, 2016 · Operations

What Ancient Medicine Teaches About Modern IT Risk Management

Using the classic tale of Bian Que, this article explains how proactive, mid‑stage, and reactive risk controls in IT operations prevent small issues from becoming catastrophic failures, illustrated with real‑world storage, cloud, and equipment‑selection case studies.

IT infrastructureOperationspreventive control
0 likes · 7 min read
What Ancient Medicine Teaches About Modern IT Risk Management
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Nov 14, 2016 · Operations

Master Real-Time Hadoop Alerts with Transwarp Manager

Deploying the Transwarp Manager alert system within Hadoop clusters enables operators to monitor resource shortages, failures, and health issues in real time, offering browsing, configurable thresholds, and instant email or script notifications to quickly identify and resolve problems before they impact services.

Alert MonitoringHadoopOperations
0 likes · 9 min read
Master Real-Time Hadoop Alerts with Transwarp Manager
Architecture Digest
Architecture Digest
Nov 10, 2016 · Operations

Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution

In this interview, Lu Pengcheng, a platform architect at Mogu Street, discusses the company’s large‑scale e‑commerce architecture, the evolution of its monitoring platform, design choices for high‑availability distributed systems, and future open‑source plans, providing practical insights for engineers and technical managers.

C++Distributed SystemsOperations
0 likes · 9 min read
Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution
Efficient Ops
Efficient Ops
Nov 9, 2016 · Operations

How to Design Effective SLOs and SLAs: A Technical Deep Dive

This article explains the definitions of service, SLI, SLO, and SLA, outlines how to choose and measure appropriate indicators, shares best practices for setting and improving SLOs, and shows how SLAs combine objectives with consequences to manage service reliability.

Cloud ComputingOperationsSLA
0 likes · 11 min read
How to Design Effective SLOs and SLAs: A Technical Deep Dive
Node Underground
Node Underground
Nov 9, 2016 · Operations

4 Common Node.js Ops Issues and How to Fix Them

This article outlines four frequent Node.js operational problems—memory leaks, CPU bottlenecks, back‑pressure, and security risks—and provides practical solutions such as heap‑dump analysis, CPU profiling, APM monitoring, and using private npm registries with tools like Snyk to secure dependencies.

Memory LeakNode.jsOperations
0 likes · 4 min read
4 Common Node.js Ops Issues and How to Fix Them
ITPUB
ITPUB
Nov 9, 2016 · Operations

Diagnosing and Resolving High CPU Usage in a Linux Gateway Process

This article walks through a real‑world remote debugging session where a high‑CPU issue in a gateway service was reproduced, analyzed with top, gstack, gcore, strace and gdb, and traced to a buffer overflow causing an infinite loop, then fixed.

CPUOperationsgdb
0 likes · 7 min read
Diagnosing and Resolving High CPU Usage in a Linux Gateway Process
Efficient Ops
Efficient Ops
Nov 7, 2016 · Operations

How to Train New SREs Effectively: Proven Practices and Playbooks

This article outlines a systematic approach to onboarding and training new Site Reliability Engineers, covering trust building, readiness assessment, diverse learning methods, structured curricula, on‑call milestones, project‑focused work, reverse‑engineering skills, statistical thinking, and improvisation techniques to develop high‑performing SRE teams.

On-CallOperationsReverse Engineering
0 likes · 17 min read
How to Train New SREs Effectively: Proven Practices and Playbooks
ITPUB
ITPUB
Nov 2, 2016 · Operations

Monitor Linux System Resources with Simple Shell Scripts

This guide shows how to write Bash functions that retrieve process IDs, CPU, memory, file‑descriptor usage, port status, system load and disk space on a Linux server, and how to combine them with conditional checks to generate alerts when thresholds are exceeded.

LinuxOperationsscript
0 likes · 16 min read
Monitor Linux System Resources with Simple Shell Scripts

JEN: JD Extended Nginx Platform for Scalable Management and Automation

The article introduces JEN, JD's extended Nginx platform that centralizes configuration, monitoring, traffic splitting, rate limiting and automated operations through a web console and Ansible integration, addressing the complexity, restart requirements, and scaling challenges of large‑scale Nginx deployments.

Configuration ManagementOperationsRate Limiting
0 likes · 14 min read
JEN: JD Extended Nginx Platform for Scalable Management and Automation
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Oct 31, 2016 · Cloud Computing

How Taobao Scaled from LAMP to Cloud: Lessons in Cloud Migration Architecture

This article examines the evolution of Taobao's technical architecture—from a LAMP stack through Oracle‑based mainframes to a cloud‑native platform—highlighting the performance, scalability, and cost challenges of traditional IT and offering best‑practice strategies for migrating enterprise systems to the cloud.

Big DataCloud ComputingDatabases
0 likes · 15 min read
How Taobao Scaled from LAMP to Cloud: Lessons in Cloud Migration Architecture
Efficient Ops
Efficient Ops
Oct 29, 2016 · Databases

Why Your System Slows Down: Uncover Hidden Database Bottlenecks

The article explains how unnoticed database issues often cause system slowness, outlines key diagnostic questions for operations teams, and presents a three‑step approach—discover, solve, prevent—to regularly health‑check and optimize databases for reliable performance.

DatabasesOperationsTroubleshooting
0 likes · 8 min read
Why Your System Slows Down: Uncover Hidden Database Bottlenecks