Tagged articles
2193 articles
Page 21 of 22
MaGe Linux Operations
MaGe Linux Operations
Aug 8, 2017 · Operations

Essential Automation Ops Resources: Books, Tools, and News Sources

This guide highlights the urgent need for automation in modern operations and curates essential books, documentation, and information sources covering Puppet, Nagios, Zabbix, Linux scripting, high‑availability servers, and Python‑based automation to help both seasoned engineers and newcomers alike.

BooksMonitoringtools
0 likes · 11 min read
Essential Automation Ops Resources: Books, Tools, and News Sources
High Availability Architecture
High Availability Architecture
Aug 8, 2017 · Big Data

Practical Big Data Architecture Evolution and Lessons Learned

The article reviews the evolution of big‑data architectures from a simple RDB‑centric pipeline to a SaaS‑based solution, highlighting common bottlenecks such as scaling, integration, cost, and operational complexity, and shares practical experiences and best‑practice recommendations for building efficient, maintainable data platforms.

Big DataMonitoringSaaS
0 likes · 12 min read
Practical Big Data Architecture Evolution and Lessons Learned
Architecture Digest
Architecture Digest
Aug 7, 2017 · Operations

Website Availability and High‑Availability Architecture Overview

This article explains website availability metrics, fault‑weight scoring, layered high‑availability architecture, session management strategies, reusable service design, data redundancy, quality assurance processes, and monitoring practices essential for maintaining reliable large‑scale web systems.

AvailabilityMonitoringOperations
0 likes · 9 min read
Website Availability and High‑Availability Architecture Overview
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Aug 6, 2017 · Backend Development

How Meizu Scales Real‑Time Push to 600 M Messages/min: Architecture, Pitfalls & Solutions

The article details Meizu's massive real‑time push system handling 25 million online users and 600 million messages per minute, explains its four‑layer architecture, and shares how the team tackled phone power consumption, mobile network instability, massive connections, monitoring, and gray‑release deployment.

Distributed SystemsGray ReleaseMobile Optimization
0 likes · 13 min read
How Meizu Scales Real‑Time Push to 600 M Messages/min: Architecture, Pitfalls & Solutions
Efficient Ops
Efficient Ops
Aug 4, 2017 · Operations

How Tencent’s ZhiYun Platform Powered the “Military Photo” Campaign with 4,000 Servers

This article details how Tencent's SNG operations team leveraged the ZhiYun intelligent operations platform—through standardized processes, massive IaaS provisioning, CMDB management, automated workflows, and real‑time capacity monitoring—to support the high‑traffic “Military Photo” H5 campaign, scaling up to 4,000 servers and 24 GB bandwidth.

CMDBCloud ComputingIaS
0 likes · 10 min read
How Tencent’s ZhiYun Platform Powered the “Military Photo” Campaign with 4,000 Servers
Efficient Ops
Efficient Ops
Aug 2, 2017 · Operations

Essential Ops Playbook: 6 Key Practices to Prevent Disasters

Drawing from a year‑and‑a‑half of ops experience, this guide outlines six practical categories—online operation standards, data handling, security, daily monitoring, performance tuning, and mindset—to help engineers avoid costly mistakes and maintain stable, secure systems.

MonitoringOperationsPerformance tuning
0 likes · 12 min read
Essential Ops Playbook: 6 Key Practices to Prevent Disasters
ITPUB
ITPUB
Jul 17, 2017 · Operations

Essential Linux Ops Tools Every Sysadmin Should Master

This guide outlines the core Linux system fundamentals, networking services, scripting languages, text‑processing utilities, database handling, firewall configuration, monitoring solutions, clustering, and backup techniques that form the essential toolkit for aspiring Linux operations engineers.

LinuxMonitoringOperations
0 likes · 7 min read
Essential Linux Ops Tools Every Sysadmin Should Master
MaGe Linux Operations
MaGe Linux Operations
Jul 15, 2017 · Fundamentals

Master Python File Operations and System Automation with Practical Code Examples

This article presents a comprehensive collection of Python tutorials and scripts covering file I/O modes, directory traversal, log analysis, simple games, command‑line argument handling, process monitoring, port checking, authentication loops, and SNMP‑based CPU and network traffic monitoring, providing a solid foundation for automation and operations tasks.

Monitoringfile-iosysadmin
0 likes · 15 min read
Master Python File Operations and System Automation with Practical Code Examples
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Jul 13, 2017 · Cloud Computing

Inside 360’s Ultron: How OpenStack Powers a Scalable Private Cloud

This article details the evolution, architecture, deployment, monitoring, and performance optimization of Ultron—360’s internal OpenStack‑based virtualization platform—covering its three development stages, technical stack, automation with Ansible, advanced features like VXLAN and Ceph, and lessons learned from large‑scale operations.

CephDPDKMonitoring
0 likes · 19 min read
Inside 360’s Ultron: How OpenStack Powers a Scalable Private Cloud
DevOps
DevOps
Jul 12, 2017 · Cloud Native

Container Monitoring: Challenges, Metrics Collection, and Best Practices

This article examines the unique challenges of monitoring containers, outlines three categories of metrics to collect, compares host‑centric and layered monitoring architectures, provides detailed methods for gathering CPU, memory, I/O and network data via cgroup files and Docker commands, and shares practical insights, tooling recommendations, and a Q&A session for effective container observability.

DockerMonitoringPrometheus
0 likes · 18 min read
Container Monitoring: Challenges, Metrics Collection, and Best Practices
MaGe Linux Operations
MaGe Linux Operations
Jul 9, 2017 · Operations

Mastering Game Operations: From Legacy Servers to Modern Cloud Strategies

An in‑depth look at the evolution of game operations—from early PC and web games to today’s mobile and cloud‑based titles—covering architecture, Tcaplus storage, CMDB building, automated deployment, performance monitoring, data warehousing, and the essential skills and challenges faced by game ops engineers.

CMDBMonitoringgame operations
0 likes · 27 min read
Mastering Game Operations: From Legacy Servers to Modern Cloud Strategies
Efficient Ops
Efficient Ops
Jul 6, 2017 · Operations

36 Ops Strategies: Permissions, Documentation, and Capacity Management

The article shares practical operations lessons—from periodic permission audits and thorough documentation to capacity monitoring, log rotation, and automation—illustrating how systematic practices and tooling can standardize and streamline IT infrastructure management.

DocumentationIT ManagementMonitoring
0 likes · 8 min read
36 Ops Strategies: Permissions, Documentation, and Capacity Management
21CTO
21CTO
Jul 6, 2017 · Big Data

How HBase Boosted Tencent Monitoring Platform Performance 3‑5×

Facing the challenge of storing over 120 billion daily monitoring points from hundreds of thousands of servers, Tencent’s monitoring platform migrated from a custom solution and OpenTSDB to a finely tuned HBase architecture, achieving 3‑5× higher throughput, improved reliability, and significant storage savings.

DistributedStorageHBaseMonitoring
0 likes · 11 min read
How HBase Boosted Tencent Monitoring Platform Performance 3‑5×
Qunar Tech Salon
Qunar Tech Salon
Jul 4, 2017 · Big Data

Design and Evolution of Airbnb's Log Data Storage and Query Platform

The article describes how Airbnb's data infrastructure team built a next‑generation log storage and query platform to improve data quality, timeliness, flexibility, and anomaly detection, outlining the system architecture, key requirements, five improvement areas, and the resulting benefits.

AirbnbMonitoringdata pipeline
0 likes · 7 min read
Design and Evolution of Airbnb's Log Data Storage and Query Platform
Suning Technology
Suning Technology
Jul 3, 2017 · Operations

Inside Suning’s Intelligent Ops Forum: How Tech Leaders Automate and AI‑Boost Operations

The Suning Cloud Commerce IT headquarters hosted a comprehensive Intelligent Operations forum featuring experts from Alibaba, Weibo, Meituan, 360, Meizu and PPD, who shared practical insights on automation, platformization, AI‑driven big‑data analytics, network automation, security, and monitoring across modern IT operations.

Intelligent OperationsMonitoring
0 likes · 8 min read
Inside Suning’s Intelligent Ops Forum: How Tech Leaders Automate and AI‑Boost Operations
Efficient Ops
Efficient Ops
Jun 11, 2017 · Operations

How Bilibili Scaled Its Ops: From DIY Deployments to Prometheus Monitoring

From early manual deployments to a sophisticated, multi-layered monitoring stack—including ELK, Zabbix, Statsd, Grafana, and Prometheus—Bilibili’s ops team shares the evolution, challenges, and lessons learned in building scalable, automated infrastructure for massive internet traffic.

DevOpsELKGrafana
0 likes · 8 min read
How Bilibili Scaled Its Ops: From DIY Deployments to Prometheus Monitoring
ITPUB
ITPUB
Jun 9, 2017 · Operations

Mastering Effective Monitoring: From Basics to the USE Method

This article explains the fundamentals of monitoring, distinguishes traditional OPS from SRE perspectives, defines monitoring objects and metrics, introduces quantitative thinking with SLI/SLO, and presents the USE method with a MySQL example to help engineers detect and prevent failures efficiently.

MetricsMonitoringOperations
0 likes · 10 min read
Mastering Effective Monitoring: From Basics to the USE Method
Baidu Waimai Technology Team
Baidu Waimai Technology Team
Jun 6, 2017 · Backend Development

Design and Optimization of Baidu Waimai Activity Module Architecture

This article presents a comprehensive redesign of Baidu Waimai’s client‑side activity module, detailing background challenges, design goals, functional and performance specifications, trade‑off analyses of three architectural alternatives, and the chosen parallel HTTP‑request solution with monitoring, degradation, and phased rollout plans.

MonitoringPerformance OptimizationRedis
0 likes · 8 min read
Design and Optimization of Baidu Waimai Activity Module Architecture
ITPUB
ITPUB
May 31, 2017 · Operations

Automate Bulk Host Addition for Cacti and Nagios with Simple Scripts

The article explains how to automate the tedious process of adding multiple hosts to Cacti and Nagios by using shell‑wrapped PHP scripts and custom templates, provides download links, and shares practical tips to avoid common installation pitfalls.

BatchCactiMonitoring
0 likes · 5 min read
Automate Bulk Host Addition for Cacti and Nagios with Simple Scripts
Qunar Tech Salon
Qunar Tech Salon
May 19, 2017 · Mobile Development

Zero‑Instrumentation Interaction and Performance Monitoring for Large‑Scale Mobile Apps

The article presents a comprehensive approach to solving crash and performance issues in large‑scale mobile applications by reconstructing user interaction traces through a no‑track analytics platform, compile‑time AOP instrumentation, and unified data aggregation, ultimately improving debugging efficiency and reducing operational overhead.

AOPAnalyticsMonitoring
0 likes · 9 min read
Zero‑Instrumentation Interaction and Performance Monitoring for Large‑Scale Mobile Apps
ITPUB
ITPUB
May 15, 2017 · Operations

Mastering Online Incident Management: From Detection to Prevention

This article outlines a comprehensive methodology for handling large‑scale online service incidents, covering goals, the "jump‑fill‑avoid" framework, step‑by‑step processes for detection, diagnosis, remediation, and post‑mortem analysis, as well as essential monitoring, logging, and escalation infrastructure.

Incident ManagementMonitoringOperations
0 likes · 18 min read
Mastering Online Incident Management: From Detection to Prevention
Qunar Tech Salon
Qunar Tech Salon
May 11, 2017 · Operations

Designing Performance Test Scenarios: Models, Metrics, and Strategies

This article explains how to design performance testing scenarios, covering test models, metrics, script preparation, concurrency calculations, pressure strategies, run times, delay settings, user termination, monitoring methods, and various typical scenario types such as baseline, load, mixed, capacity, large‑concurrency, stability and scalability tests.

MonitoringTPSconcurrency
0 likes · 24 min read
Designing Performance Test Scenarios: Models, Metrics, and Strategies
MaGe Linux Operations
MaGe Linux Operations
May 10, 2017 · Operations

Step‑by‑Step: Monitor Nginx and PHP‑FPM Status with Zabbix

This guide walks through configuring Zabbix to monitor Nginx and PHP‑FPM status, covering software installation paths, enabling status modules, creating extraction scripts, setting up Zabbix agent userparameters, restarting services, testing data retrieval, and adding server‑side templates for items, triggers, and graphs.

LinuxMonitoringScripting
0 likes · 9 min read
Step‑by‑Step: Monitor Nginx and PHP‑FPM Status with Zabbix
Efficient Ops
Efficient Ops
May 9, 2017 · Backend Development

How Tencent Scaled QQ Red Packet to 100k QPS: Architecture & Lessons

This article details how Tencent's AMS system was analyzed, traffic‑estimated, and redesigned for high‑availability during the QQ Spring Festival Red Packet event, covering architecture mapping, scaling strategies, overload protection, flexible availability, disaster recovery, monitoring, and practical lessons learned.

MonitoringScalingbackend
0 likes · 25 min read
How Tencent Scaled QQ Red Packet to 100k QPS: Architecture & Lessons
DevOps
DevOps
May 9, 2017 · Operations

A Clear and Concise DevOps Implementation Framework: 11 Core Service Capabilities

This article introduces a straightforward DevOps implementation framework that maps eleven essential service capabilities across the software development lifecycle, explains why adopting DevOps is a multi‑year journey, and uses a fitness analogy to illustrate how enterprises can progressively build these capabilities.

Continuous DeliveryDevOpsMonitoring
0 likes · 4 min read
A Clear and Concise DevOps Implementation Framework: 11 Core Service Capabilities
Efficient Ops
Efficient Ops
May 3, 2017 · Operations

How Tencent Scales NBA Live Streams to Millions: Behind the Tech and Operations

This article details Tencent's large‑scale live streaming architecture for NBA games, covering the rapid growth of live video, key technical features, network transmission challenges, multi‑angle production, CDN deployment, monitoring, big‑data processing, and strategies for ensuring low latency and high reliability for millions of concurrent viewers.

Big DataCDNMonitoring
0 likes · 25 min read
How Tencent Scales NBA Live Streams to Millions: Behind the Tech and Operations
DevOps
DevOps
Apr 25, 2017 · Operations

Analyzing and Visualizing Docker Logs with the ELK Stack (Part Two)

This article explains how to analyze and visualize Docker container logs using the ELK stack, covering preparation, parsing tips, Kibana query techniques, and example visualizations to help monitor Dockerized environments effectively in production.

DockerELKKibana
0 likes · 7 min read
Analyzing and Visualizing Docker Logs with the ELK Stack (Part Two)
MaGe Linux Operations
MaGe Linux Operations
Apr 17, 2017 · Operations

Essential Linux & Server Commands: From Log Cleanup to RAID and Monitoring

This guide presents practical Linux and server administration commands, covering log cleanup, nginx IP analysis, tcpdump capture, Python date formatting and string reversal, subprocess execution, multiprocessing, iptables port forwarding, cron scheduling, file relocation, RAID concepts, Oracle backup strategies, port checking, Apache MPM modes, and monitoring tool comparisons.

DatabaseLinuxMonitoring
0 likes · 10 min read
Essential Linux & Server Commands: From Log Cleanup to RAID and Monitoring
Efficient Ops
Efficient Ops
Apr 16, 2017 · Operations

How China Life Built a Self‑Developed Automated Ops Platform from Scratch

China Life’s Shanghai Data Center team transformed chaotic, multi‑system operations into a unified, automated platform by standardizing hardware, processes, and tools, leveraging OpenStack, Docker, Zabbix, and custom scripts, ultimately achieving efficient monitoring, change management, and a mobile‑enabled DevOps workflow.

MonitoringOps Platformautomation
0 likes · 17 min read
How China Life Built a Self‑Developed Automated Ops Platform from Scratch
dbaplus Community
dbaplus Community
Apr 13, 2017 · Backend Development

Scalable Small .NET E‑Commerce Architecture: Monitoring, DB Master‑Slave & Capacity Planning

This guide walks through the evolution of a small .NET‑based e‑commerce system, covering its initial LAMP‑style setup, detailed backend architecture, logging and monitoring solutions, master‑slave database design, shared‑storage image server, mobile M‑site construction, capacity estimation methods, and caching strategies.

DatabaseMonitoringarchitecture
0 likes · 22 min read
Scalable Small .NET E‑Commerce Architecture: Monitoring, DB Master‑Slave & Capacity Planning
Efficient Ops
Efficient Ops
Apr 12, 2017 · Operations

Mastering Enterprise Monitoring: From Basics to Advanced Toolchains

This comprehensive guide explains why monitoring is vital for operations, outlines clear objectives and methods, compares popular open‑source and commercial tools, details a Zabbix‑based workflow, and covers hardware, system, application, network, security, API, performance, and business metrics with practical alerting strategies.

MonitoringOperationsalerting
0 likes · 21 min read
Mastering Enterprise Monitoring: From Basics to Advanced Toolchains
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Apr 10, 2017 · Operations

Sentinel Monitoring System: Real‑Time Business Log Monitoring and Incident Detection for an Airline Ticket Platform

The Sentinel system was built to provide real‑time, zero‑modification monitoring of airline ticket business services by consuming Tianwang logs through a Storm cluster, offering flexible rule configuration, addressing performance pitfalls, and planning future enhancements such as custom monitoring scripts and visual dashboards.

KafkaLog ProcessingMonitoring
0 likes · 6 min read
Sentinel Monitoring System: Real‑Time Business Log Monitoring and Incident Detection for an Airline Ticket Platform
Efficient Ops
Efficient Ops
Apr 9, 2017 · Cloud Native

How Ctrip Achieved Seconds‑Level Scaling with a Container Cloud

Ctrip built a private container cloud to handle massive seasonal traffic spikes, enabling rapid, automated scaling and shrinking of resources, improving deployment speed, resource utilization, and operational intelligence across more than 20 business units.

ContainerizationCtripMonitoring
0 likes · 16 min read
How Ctrip Achieved Seconds‑Level Scaling with a Container Cloud
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 6, 2017 · Backend Development

How Alibaba Scaled GitLab to Support Millions of Users with Sharding and High‑Availability

This article details Alibaba Group's journey of transforming its GitLab deployment from a single‑node setup to a distributed, sharded architecture that handles tens of millions of daily requests, achieves near‑perfect reliability, and incorporates performance, monitoring, and disaster‑recovery innovations.

GitLabMonitoringPerformance Optimization
0 likes · 15 min read
How Alibaba Scaled GitLab to Support Millions of Users with Sharding and High‑Availability
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Apr 6, 2017 · Operations

How SRE’s Dialectical Thinking Redefines Modern Operations

An insightful reflection on Google’s SRE philosophy shows how dialectical thinking—questioning absolute stability, embracing limited toil, prioritizing simple monitoring, recognizing automation’s hidden risks, and practicing real‑world failure drills—can reshape operations, encouraging smarter, more resilient system design.

MonitoringReliabilitySRE
0 likes · 7 min read
How SRE’s Dialectical Thinking Redefines Modern Operations
Efficient Ops
Efficient Ops
Mar 30, 2017 · Backend Development

Designing a Scalable, Configurable Distributed Web Crawler

This article outlines the motivation, requirements, modular decomposition, and architecture of a distributed web crawling platform that emphasizes reusability, lightweight modules, real‑time monitoring, and easy configuration for diverse data‑collection tasks.

Backend ArchitectureMonitoringconfiguration
0 likes · 10 min read
Designing a Scalable, Configurable Distributed Web Crawler
Baidu Waimai Technology Team
Baidu Waimai Technology Team
Mar 30, 2017 · Backend Development

Design and Implementation of a Unified Voucher Issuance Platform for Baidu Waimai

This article describes the design, architecture, and operational features of Baidu Waimai's unified voucher issuance platform, detailing its four‑layer backend structure, permission and strategy configurations, flow‑control mechanisms, service isolation, monitoring visualizations, and re‑entrancy safeguards to support large‑scale marketing distribution.

Backend ArchitectureFlow ControlMonitoring
0 likes · 7 min read
Design and Implementation of a Unified Voucher Issuance Platform for Baidu Waimai
Qunar Tech Salon
Qunar Tech Salon
Mar 23, 2017 · Cloud Native

Ctrip Container Cloud: Architecture, Scaling, and Operational Practices

The article details Ctrip's rapid business growth driving the need for elastic scaling, the adoption of container technology to achieve second‑level provisioning, the design of their container cloud platform—including deployment principles, network choices, orchestration evaluations, monitoring solutions, and the CDOS overview—providing practical insights for large‑scale cloud‑native operations.

ContainerizationDevOpsMonitoring
0 likes · 16 min read
Ctrip Container Cloud: Architecture, Scaling, and Operational Practices
Baidu Intelligent Testing
Baidu Intelligent Testing
Mar 21, 2017 · Operations

Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details

This article presents a comprehensive server‑side monitoring solution covering functional and performance requirements, monitoring objects, design choices between self‑monitoring and centralized reporting, system architecture, API definitions, key challenges such as key collisions, data formats, storage options, and operational considerations.

MetricsMonitoringOperations
0 likes · 12 min read
Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details
ITPUB
ITPUB
Mar 20, 2017 · Operations

How to Diagnose Linux Performance in the First 60 Seconds

Learn the essential Linux command-line tools and step-by-step commands you need to run within the first minute of logging into a server to quickly assess process activity, resource usage, and potential bottlenecks, enabling effective performance troubleshooting in production environments.

Command-lineMonitoringPerformance
0 likes · 12 min read
How to Diagnose Linux Performance in the First 60 Seconds
Architecture Digest
Architecture Digest
Mar 18, 2017 · Backend Development

Technical Strategies for Startup Engineering Teams: Simplicity, Cloud Servers, Databases, Caching, and DevOps

The article outlines practical engineering guidelines for internet startups, emphasizing simplicity, rapid development, resource efficiency, and the use of cloud servers, MySQL, caching, asynchronous processing, logging, monitoring, documentation, and integrated build‑deploy pipelines to build stable, low‑cost backend systems.

Backend DevelopmentCachingMonitoring
0 likes · 16 min read
Technical Strategies for Startup Engineering Teams: Simplicity, Cloud Servers, Databases, Caching, and DevOps
Ctrip Technology
Ctrip Technology
Mar 17, 2017 · Cloud Computing

Ctrip Container Cloud: Architecture, Elastic Scaling, and Monitoring Practices

This article details Ctrip's journey in building a private container cloud to support rapid business growth, covering elasticity challenges, container deployment principles, orchestration platform choices, network design, operational issues, custom executors, monitoring solutions, and the overarching CDOS system.

Cloud ComputingDockerMesos
0 likes · 16 min read
Ctrip Container Cloud: Architecture, Elastic Scaling, and Monitoring Practices
High Availability Architecture
High Availability Architecture
Mar 16, 2017 · Operations

Stormcrow: Dropbox’s Scalable Feature‑Flag Platform for Rapid Deployment and A/B Testing

The article describes Dropbox’s Stormcrow system, a configurable feature‑gate platform that enables fast, safe rollout of new functionality across web, desktop, and mobile clients, supports granular A/B testing, leverages custom data fields, and integrates deployment, monitoring, and audit tooling for large‑scale operations.

A/B testingDeploymentMonitoring
0 likes · 15 min read
Stormcrow: Dropbox’s Scalable Feature‑Flag Platform for Rapid Deployment and A/B Testing
Efficient Ops
Efficient Ops
Mar 1, 2017 · Operations

How Metrics-Driven Development Transforms Software Iteration and Ops

Metrics‑Driven Development (MDD) extends test‑driven principles by embedding real‑time monitoring into design, enabling rapid, precise, and granular software iterations, improving early problem detection, decision support, and aligning development with DevOps culture.

MetricsMonitoringObservability
0 likes · 13 min read
How Metrics-Driven Development Transforms Software Iteration and Ops
Efficient Ops
Efficient Ops
Feb 28, 2017 · Operations

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.

Incident ResponseMonitoringOperations
0 likes · 18 min read
Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response
Meituan Technology Team
Meituan Technology Team
Feb 24, 2017 · Operations

Improvements and Architecture of Mt-Falcon Monitoring System

Mt‑Falcon, Meituan’s re‑engineered successor to Zabbix, introduces a modular architecture—Agent, Transfer, HBS, Judge, Graph, Alarm, Portal—and extensive refactorings that boost memory efficiency, asynchronous data handling, multi‑condition alerts, and API exposure, enabling over one million QPS, 200 million metrics, and robust, scalable monitoring across the company.

Monitoringalertingarchitecture
0 likes · 24 min read
Improvements and Architecture of Mt-Falcon Monitoring System
Efficient Ops
Efficient Ops
Feb 21, 2017 · Mobile Development

How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes

This article details Alibaba's mobile app operational practices, covering the challenges of client-side maintenance, their high‑frequency release pipeline, gray‑release mechanisms, monitoring, trace systems, remote logging, and rapid issue resolution to ensure stability and performance at massive scale.

Gray ReleaseMobileMonitoring
0 likes · 21 min read
How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes
转转QA
转转QA
Feb 13, 2017 · Databases

Redis Connection Pool Saturation: A Debugging Tale

A developer recounts how a Redis connection pool overflow across dozens of clusters was traced to a single misbehaving service, diagnosed with netstat and ps commands, and resolved by adjusting configuration and stopping the offending process, illustrating practical troubleshooting of connection limits.

Connection PoolMonitoringOperations
0 likes · 4 min read
Redis Connection Pool Saturation: A Debugging Tale
dbaplus Community
dbaplus Community
Feb 9, 2017 · Operations

Scalable Monitoring for Massive Physical & Container Clusters: JD's MDC Insights

This article shares JD’s large‑scale monitoring system (MDC) design, covering its three‑tier architecture, agent‑based data collection, performance optimizations for SNMP/IPMI, low‑overhead deployment, high‑availability strategies, and practical lessons on scaling monitoring across thousands of physical machines and containers.

JDMDCMonitoring
0 likes · 10 min read
Scalable Monitoring for Massive Physical & Container Clusters: JD's MDC Insights
dbaplus Community
dbaplus Community
Feb 6, 2017 · Operations

How JD’s CallGraph Transforms Distributed Tracing for Real‑Time Operations

CallGraph, JD.com’s in‑house distributed tracing platform, provides low‑intrusion, high‑performance monitoring for micro‑service ecosystems, enabling real‑time call‑graph analysis, TP metrics, flexible configuration, and future extensions such as deep‑learning‑driven insights.

Log ProcessingMonitoringdistributed tracing
0 likes · 15 min read
How JD’s CallGraph Transforms Distributed Tracing for Real‑Time Operations
Node Underground
Node Underground
Jan 24, 2017 · Operations

11 Essential Practices to Master Node.js Application Monitoring

Effective Node.js monitoring boosts competitiveness, user experience, and cost efficiency, and this guide outlines eleven key recommendations—from tracking downtime and response thresholds to linking performance with business metrics and leveraging third‑party APM tools—ensuring robust, noise‑free alerts and secure, scalable applications.

APMDevOpsMonitoring
0 likes · 3 min read
11 Essential Practices to Master Node.js Application Monitoring
Efficient Ops
Efficient Ops
Jan 22, 2017 · Operations

What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

The 2016 Ops Alert Report reveals Zabbix’s dominance, preferred notification channels, monthly and daily alert trends, peak alert times, regional distribution, and quirky usage statistics, offering valuable insights for operations teams to optimize monitoring and incident response.

MonitoringOperationsalerts
0 likes · 5 min read
What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

Building a Scalable Business Monitoring System: Architecture, Modules & Lessons

This article presents a comprehensive case study of a business monitoring system, covering its background, architectural analysis, module design, time‑series database selection, visualization with Grafana, alerting strategies, decision‑making logic, and intelligent monitoring experiments, followed by key takeaways and lessons learned.

GrafanaInfluxDBMonitoring
0 likes · 12 min read
Building a Scalable Business Monitoring System: Architecture, Modules & Lessons
Liulishuo Tech Team
Liulishuo Tech Team
Dec 31, 2016 · Cloud Native

Designing Scalable and Reliable Backend Services at English Fluently: Architecture, Service Discovery, Monitoring, and Autoscaling

This article shares the engineering team’s experience of building a high‑growth, reliable backend for English Fluently, covering inter‑service communication with gRPC, service discovery, Docker‑based deployment, health‑checking, monitoring, autoscaling, Kubernetes orchestration, and multi‑cell availability strategies.

AutoscalingDockerKubernetes
0 likes · 10 min read
Designing Scalable and Reliable Backend Services at English Fluently: Architecture, Service Discovery, Monitoring, and Autoscaling
dbaplus Community
dbaplus Community
Dec 26, 2016 · Databases

How to Build a Scalable, Automated MySQL Operations Platform

This article explains how to standardize and automate MySQL management at scale, covering dedicated instance deployment, configuration consistency, multi‑instance creation, metadata collection, backup, monitoring, high‑availability with Zookeeper, and task orchestration using DBTask to achieve rapid, reliable database services.

DBTaskDatabase operationsMonitoring
0 likes · 12 min read
How to Build a Scalable, Automated MySQL Operations Platform
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 26, 2016 · Operations

How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions

Alibaba’s SunFire platform delivers massive‑scale, real‑time log collection, processing, and visualization for e‑commerce spikes like Double 11, using low‑overhead agents, asynchronous Map/Reduce pipelines, fault‑tolerant task scheduling, and shared inputs to ensure accurate, low‑latency monitoring across billions of transactions.

AlibabaMonitoringOperations
0 likes · 18 min read
How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions
Weidian Tech Team
Weidian Tech Team
Dec 15, 2016 · Databases

How to Build a Scalable Automated MySQL Operations Platform

This article explains how to standardize and automate MySQL operations—including multi‑instance deployment, metadata collection, monitoring, backup, and high‑availability using Zookeeper—so that large‑scale database services can be provisioned, managed, and scaled with minimal human intervention.

Database operationsMonitoringScaling
0 likes · 11 min read
How to Build a Scalable Automated MySQL Operations Platform
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 12, 2016 · Cloud Native

How Alibaba Built a Decade-Long Microservice Architecture: Challenges and Lessons

This article chronicles Alibaba's ten‑year journey from monolithic Java EE deployments to a cloud‑native microservice ecosystem, detailing the technical challenges, the evolution of its EDAS RPC frameworks, comprehensive monitoring, capacity planning, and the strategies that enabled resilient large‑scale services during massive traffic events.

Cloud NativeMonitoringcapacity planning
0 likes · 11 min read
How Alibaba Built a Decade-Long Microservice Architecture: Challenges and Lessons
Ctrip Technology
Ctrip Technology
Dec 2, 2016 · Backend Development

Challenges and Practices in Service‑Oriented Splitting of Qunar Payment System

The article details the technical challenges encountered during the service‑oriented decomposition of Qunar's payment platform, covering Dubbo and HTTP service conventions, database sharding and read/write separation, asynchronous processing, multi‑system management, and comprehensive monitoring and alerting solutions.

Monitoringasynchronous processingdatabase sharding
0 likes · 10 min read
Challenges and Practices in Service‑Oriented Splitting of Qunar Payment System
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 21, 2016 · Operations

Taobao’s Scaling Secrets: Stateless Sessions, Caching, Service Splitting & Sharding

This article explains how Taobao achieves horizontal scalability by adopting stateless session handling, efficient client‑side cookie storage, multi‑level caching, service splitting with HSF, database sharding via TDDL, asynchronous messaging, unstructured data storage, and comprehensive monitoring and configuration management.

CachingMonitoringService Splitting
0 likes · 18 min read
Taobao’s Scaling Secrets: Stateless Sessions, Caching, Service Splitting & Sharding
Efficient Ops
Efficient Ops
Nov 20, 2016 · Operations

Why Most Log‑Analysis Features Are Overrated and What Really Matters

The article critiques popular but unnecessary log‑analysis features—such as sub‑second alerts, endless pagination, flashy maps, full SQL support, bulk downloads, and live tail—arguing that focusing on practical alert content, efficient querying, and proper architecture yields far more value for IT operations.

DSLData visualizationMonitoring
0 likes · 10 min read
Why Most Log‑Analysis Features Are Overrated and What Really Matters
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 20, 2016 · Backend Development

How Meizu Scales Real‑Time Push to 600 M Messages/min: Architecture, Pitfalls & Solutions

The article details Meizu’s real‑time push system that supports 25 million online users and 6 million messages per minute, describing its four‑layer architecture, power‑saving strategies, network‑instability fixes, massive‑connection handling, monitoring practices, and gray‑release deployment techniques.

Distributed SystemsMonitoringhigh concurrency
0 likes · 12 min read
How Meizu Scales Real‑Time Push to 600 M Messages/min: Architecture, Pitfalls & Solutions
Efficient Ops
Efficient Ops
Nov 17, 2016 · Operations

How Qunar Built an Automated Network Device Operations Platform to Boost Efficiency

This article explains how Qunar tackled growing network device management workload, low‑efficiency manual processes, and operational risk by designing an integrated platform that automates common tasks, enforces permission‑based controls, records audits, and provides real‑time monitoring and scalable data collection.

MonitoringPlatformautomation
0 likes · 8 min read
How Qunar Built an Automated Network Device Operations Platform to Boost Efficiency
Efficient Ops
Efficient Ops
Nov 14, 2016 · Operations

How a Banking Card Organization Built a Scalable Cloud Operations Platform

This article details the evolution from manual, standardized operations to an automated, intelligent cloud operations platform for a banking card organization, describing its motivations, core features, key scenarios, technical architecture, scheduling algorithms, data visualization, and real‑world outcomes.

MonitoringOperations ManagementService Orchestration
0 likes · 13 min read
How a Banking Card Organization Built a Scalable Cloud Operations Platform
Qunar Tech Salon
Qunar Tech Salon
Nov 12, 2016 · Backend Development

Challenges and Solutions in Service‑Oriented Splitting of Qunar Payment System

The article examines the technical challenges encountered during the service‑oriented decomposition of Qunar's payment platform—including development efficiency, interface conventions, concurrency, security, monitoring, database sharding, read‑write separation, and asynchronous processing—and presents concrete solutions and best‑practice recommendations.

Backend DevelopmentMonitoringasynchronous processing
0 likes · 10 min read
Challenges and Solutions in Service‑Oriented Splitting of Qunar Payment System
ITPUB
ITPUB
Nov 11, 2016 · Databases

Essential Oracle SQL Queries for Performance Monitoring and Troubleshooting

This guide compiles a comprehensive set of Oracle SQL statements and explanations for detecting fragmented tables, index fragmentation, high clustering factor tables, session and process mapping, DML lock analysis, DDL lock inspection, active SQL tracking, resource usage statistics, and various performance‑related metrics, helping DBAs diagnose and tune database behavior efficiently.

AdministrationDatabaseMonitoring
0 likes · 26 min read
Essential Oracle SQL Queries for Performance Monitoring and Troubleshooting
Architecture Digest
Architecture Digest
Nov 10, 2016 · Operations

Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution

In this interview, Lu Pengcheng, a platform architect at Mogu Street, discusses the company’s large‑scale e‑commerce architecture, the evolution of its monitoring platform, design choices for high‑availability distributed systems, and future open‑source plans, providing practical insights for engineers and technical managers.

C++Distributed SystemsMonitoring
0 likes · 9 min read
Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution
Nightwalker Tech
Nightwalker Tech
Nov 9, 2016 · Operations

Best Practices for Service Monitoring and Alerting in E‑commerce Systems

The discussion outlines essential service‑monitoring techniques—including health checks, JVM metrics, traffic and payment ring‑ratio analysis, client‑side exception tracking, third‑party CDN monitoring, alert thresholds, instrumentation via AOP or SDKs, and tooling such as Datadog, Zabbix, and the Elastic stack—to reliably detect and respond to incidents in e‑commerce environments.

Incident ResponseMonitoringalerting
0 likes · 10 min read
Best Practices for Service Monitoring and Alerting in E‑commerce Systems
Efficient Ops
Efficient Ops
Oct 29, 2016 · Databases

Why Your System Slows Down: Uncover Hidden Database Bottlenecks

The article explains how unnoticed database issues often cause system slowness, outlines key diagnostic questions for operations teams, and presents a three‑step approach—discover, solve, prevent—to regularly health‑check and optimize databases for reliable performance.

MonitoringOperationsPerformance
0 likes · 8 min read
Why Your System Slows Down: Uncover Hidden Database Bottlenecks
Meituan Technology Team
Meituan Technology Team
Oct 28, 2016 · Big Data

Design and Architecture of the CAT Real-Time Monitoring System

The CAT real‑time monitoring system, open‑sourced in 2014 for Java applications, combines a lightweight ThreadLocal‑based client SDK, Netty‑driven asynchronous transport, and a highly scalable backend that processes ~100 TB of logs daily across 70 machines, using custom binary serialization, in‑memory modeling, segmented storage with 48‑bit indexing, and hourly aggregation to provide near‑full‑volume fault detection, localization, and performance analysis.

Distributed SystemsJavaMonitoring
0 likes · 18 min read
Design and Architecture of the CAT Real-Time Monitoring System
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 19, 2016 · Operations

Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation

This article explains how the internally built Wonder monitoring system, based on Open‑Falcon, tackles large‑scale operational challenges by offering automated agent updates, customizable metrics, log and port monitoring, persistent alarm storage, enhanced alert content, and comprehensive dashboards for thousands of devices.

InfrastructureMonitoringOpen-Falcon
0 likes · 7 min read
Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation
Efficient Ops
Efficient Ops
Oct 17, 2016 · Operations

How Shanda Games Built a Scalable Automated Operations System

This article details Shanda Games' journey in designing and implementing a comprehensive automated operations platform—including installation, deployment, security, client and server updates, data analysis, backup, and monitoring—to efficiently manage hundreds of games across diverse hardware and operating systems.

DeploymentMonitoringOperations
0 likes · 22 min read
How Shanda Games Built a Scalable Automated Operations System
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 14, 2016 · Operations

Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method

This guide teaches operations engineers a four‑step “Wang‑Wen‑Wen‑Qie” methodology—observing system health, listening via alerts, questioning changes, and pulse‑checking with profiling tools—to rapidly diagnose and resolve high‑impact incidents while maintaining clear communication and post‑mortem learning.

Incident ResponseMonitoringOperations
0 likes · 5 min read
Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Sep 18, 2016 · Artificial Intelligence

How Linear Regression Can Tame Your Nighttime Alert Fatigue

This article explores how historical monitoring alerts can be analyzed and predicted using linear regression, guiding operations engineers to preprocess data, build regression models, and forecast future alert trends to reduce manual alarm handling and improve system stability.

Machine LearningMonitoringOperations
0 likes · 8 min read
How Linear Regression Can Tame Your Nighttime Alert Fatigue