Tagged articles
2195 articles
Page 18 of 22
Architecture Digest
Architecture Digest
Sep 23, 2019 · Operations

Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System

The article describes how a high‑traffic payment platform achieves 99.999% availability by avoiding single points of failure, applying fail‑fast principles, implementing resource limits, building real‑time monitoring and alerting, and automating fault detection, routing, and recovery to ensure continuous 7×24 operation.

backend operationsfault tolerancehigh availability
0 likes · 23 min read
Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System
Programmer DD
Programmer DD
Sep 20, 2019 · Operations

Master Prometheus: Key Features, Architecture, and Query Essentials

This article introduces Prometheus, an open‑source cloud‑native monitoring and alerting system, covering its main characteristics, core components, architecture diagram, typical use cases, query language syntax, built‑in functions, time‑series types, and practical tips for reliable operation.

OperationsPromQLPrometheus
0 likes · 9 min read
Master Prometheus: Key Features, Architecture, and Query Essentials
HomeTech
HomeTech
Sep 19, 2019 · Industry Insights

How Autohome Scaled Its 818 Global Car Night to Millions of QPS: A Technical Deep Dive

The article details how Autohome tackled a severe market downturn by launching the 818 Global Car Night, describing the background, massive technical challenges, infrastructure scaling, high‑availability architecture, full‑link stress testing, monitoring, security measures, and the lessons learned for future large‑scale online events.

Cloud ComputingScalabilityhigh availability
0 likes · 30 min read
How Autohome Scaled Its 818 Global Car Night to Millions of QPS: A Technical Deep Dive
Java Captain
Java Captain
Sep 19, 2019 · Backend Development

A Comprehensive Overview of Microservice Architecture and Its Evolution

This article presents a detailed, step‑by‑step illustration of microservice architecture, covering its motivations, component breakdown, migration from monoliths, common pitfalls, monitoring, tracing, logging, gateway, service discovery, resilience patterns, testing strategies, frameworks, and the emerging service‑mesh approach.

Service MeshTracingfault tolerance
0 likes · 23 min read
A Comprehensive Overview of Microservice Architecture and Its Evolution
Architects' Tech Alliance
Architects' Tech Alliance
Sep 17, 2019 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh and Best Practices

This article walks through the transition of an online supermarket from a simple monolithic web application to a fully fledged microservice architecture, highlighting the challenges, design decisions, component choices, monitoring, tracing, testing, and operational practices needed for a robust, scalable system.

Deploymentarchitecturemicroservices
0 likes · 24 min read
Microservice Architecture Evolution: From Monolith to Service Mesh and Best Practices
dbaplus Community
dbaplus Community
Sep 16, 2019 · Operations

How to Build Effective Monitoring for Microservices: Logs, Tracing, and Metrics Explained

This article explains the three main monitoring approaches—log collection, distributed tracing, and metric gathering—in microservice architectures, outlines the layered monitoring model, lists key system, application, and user metrics, and reviews popular open‑source time‑series monitoring tools such as Prometheus, OpenTSDB, and InfluxDB.

MetricsObservabilityPrometheus
0 likes · 10 min read
How to Build Effective Monitoring for Microservices: Logs, Tracing, and Metrics Explained
FunTester
FunTester
Sep 8, 2019 · Backend Development

How to Add Real‑Time Alert Notifications for API Test Failures in Java

This article explains how to detect server‑induced empty JSON responses during API automation, integrate the free AlertOver service for instant failure alerts, and provides complete Java code for a robust getHttpResponse method and an AlertOver utility class to send system, function, business, and reminder messages.

API testingalert notificationbackend
0 likes · 9 min read
How to Add Real‑Time Alert Notifications for API Test Failures in Java
360 Tech Engineering
360 Tech Engineering
Sep 6, 2019 · Operations

StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing

This article introduces a StackStorm‑driven ChatOps framework that consolidates monitoring alerts, applies rule‑based root‑cause analysis, and automatically executes self‑healing actions, outlining its architecture, components, workflow definitions, and practical deployment results within an enterprise operations environment.

ChatOpsOperations AutomationStackStorm
0 likes · 6 min read
StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing
Aotu Lab
Aotu Lab
Sep 6, 2019 · Frontend Development

How We Revamped Our Homepage with TypeScript, Webpack, and Accessibility Enhancements

The article details a comprehensive homepage redesign that introduced strict TypeScript type checking, migrated to a customized Webpack build, added Nightwatch.js automated tests, upgraded monitoring with BadJS and performance metrics, implemented skeleton screens, and improved accessibility for visually impaired users.

Frontend OptimizationTypeScriptaccessibility
0 likes · 16 min read
How We Revamped Our Homepage with TypeScript, Webpack, and Accessibility Enhancements
DevOps Cloud Academy
DevOps Cloud Academy
Sep 5, 2019 · Operations

An Overview of the Prometheus Monitoring System

Prometheus, an open‑source monitoring and alerting toolkit originally developed by SoundCloud and now a CNCF project, offers multidimensional data models, flexible queries, pull‑based data collection, various metric types (counter, gauge, summary, histogram), local and remote storage, service discovery, and integrates with Grafana for visualization.

Cloud NativeMetricsObservability
0 likes · 8 min read
An Overview of the Prometheus Monitoring System
Liangxu Linux
Liangxu Linux
Sep 4, 2019 · Operations

Automate Linux Memory & Swap Monitoring with Email Alerts

This guide walks through installing the msmtp email client, configuring mutt, using the free command to capture memory and swap statistics, writing Bash scripts to log and email the data, and scheduling the tasks with cron so alerts are sent when swap usage exceeds 80 %.

EmailSystem Administrationmonitoring
0 likes · 8 min read
Automate Linux Memory & Swap Monitoring with Email Alerts
MaGe Linux Operations
MaGe Linux Operations
Sep 4, 2019 · Operations

Essential Linux Ops Tools: From Nethogs to Fail2ban with Installation Guides

This article presents a curated collection of practical Linux operation tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, HTop, NMON, MultiTail, Fail2ban, Tmux, Agedu, NMap, and Httperf—along with download links, installation commands, usage tips, and illustrative screenshots to help system administrators enhance monitoring, performance testing, and security.

monitoring
0 likes · 13 min read
Essential Linux Ops Tools: From Nethogs to Fail2ban with Installation Guides
Youzan Coder
Youzan Coder
Sep 4, 2019 · Cloud Native

How Youzan Built a Highly Available Kubernetes Platform for Massive E‑commerce

This article explains why Youzan chose Kubernetes, describes their multi‑IDC, multi‑cluster architecture with high‑availability master components, logging and monitoring solutions, custom service exposure, image building process, lifecycle hooks, continuous delivery pipeline, operational challenges faced, and future plans such as operators and auto‑scaling.

KubernetesLoggingMulti-Cluster
0 likes · 11 min read
How Youzan Built a Highly Available Kubernetes Platform for Massive E‑commerce
dbaplus Community
dbaplus Community
Sep 2, 2019 · Operations

How Qunar Leverages AI‑Driven Fault Prediction and Health Management to Boost System Reliability

This article summarizes Zhang Yan's presentation at the 2019 Gdevops Global Agile Operations Summit, detailing Qunar's OPS goals, evolution of its automation platform, the adoption of PHM concepts from aerospace to internet services, and practical fault‑prediction workflows, metrics, and challenges for achieving higher availability.

PHMQunaraiops
0 likes · 24 min read
How Qunar Leverages AI‑Driven Fault Prediction and Health Management to Boost System Reliability
macrozheng
macrozheng
Aug 30, 2019 · Backend Development

How to Build and Secure a Spring Boot Admin Dashboard with Eureka Integration

This tutorial walks through setting up Spring Boot Admin as a monitoring server and client, integrating it with Eureka for service discovery, adding Spring Security for authentication, and configuring email and custom notifications, complete with Maven and YAML configurations and Java code examples.

JavaSpring BootSpring Security
0 likes · 23 min read
How to Build and Secure a Spring Boot Admin Dashboard with Eureka Integration
转转QA
转转QA
Aug 28, 2019 · Frontend Development

Using Puppeteer for UI Automation: Challenges, Solutions, and a Monitoring System

This article examines the difficulties of UI automation such as high script costs, instability, and rapid UI changes, and presents practical solutions using Puppeteer—including device emulation, robust test architecture with Mocha, error handling, dynamic selector strategies, and a monitoring system that captures screenshots and reports failures.

monitoringnodejstesting
0 likes · 11 min read
Using Puppeteer for UI Automation: Challenges, Solutions, and a Monitoring System
Youzan Coder
Youzan Coder
Aug 23, 2019 · Big Data

How to Build a Robust Event Logging Quality System with Real‑Time Validation

This article outlines common event‑logging quality problems, a systematic registration and real‑time validation framework built on Flink, configurable rule syntax, explainable results, continuous monitoring, targeted optimizations, and an evaluation model that together form a comprehensive quality‑center for big‑data platforms.

Big DataFlinkdata quality
0 likes · 11 min read
How to Build a Robust Event Logging Quality System with Real‑Time Validation
58 Tech
58 Tech
Aug 20, 2019 · Frontend Development

Architecture Design of a Front-End Monitoring Platform

This article describes the design and architecture of a front‑end monitoring platform, detailing its JS SDK, data analyzer, web UI, reference log‑collection architectures, use of Kafka, MySQL, Hive and HBase, scaling considerations, storage conventions, and operational best practices.

FrontendJavaScriptLogging
0 likes · 8 min read
Architecture Design of a Front-End Monitoring Platform
Programmer DD
Programmer DD
Aug 13, 2019 · Operations

Mastering Prometheus Histograms: How Cumulative Buckets Simplify Metrics

This article explains the fundamentals of Prometheus histogram metrics, illustrates why they are cumulative, shows how to drop unwanted buckets with relabeling, and demonstrates quantile calculations using the histogram_quantile function, providing practical examples and code snippets for effective monitoring.

HistogramMetricsObservability
0 likes · 7 min read
Mastering Prometheus Histograms: How Cumulative Buckets Simplify Metrics
DevOps
DevOps
Aug 13, 2019 · Operations

Comprehensive DevOps Toolset Overview

This article presents a detailed, categorized list of DevOps tools—including version control, automated build and testing, CI/CD, container platforms, configuration management, micro‑service platforms, logging, and monitoring solutions—providing concise descriptions for each to help teams select appropriate utilities for modern software delivery pipelines.

AutomationConfiguration ManagementDevOps
0 likes · 14 min read
Comprehensive DevOps Toolset Overview
dbaplus Community
dbaplus Community
Jul 29, 2019 · Operations

How to Build a Cost‑Effective, Multi‑Layer Monitoring System for Distributed Applications

This article explains why comprehensive, multi‑layer monitoring is essential for distributed systems, outlines environment, program, and business metrics, recommends practical tools such as Zabbix, open‑falcon, Prometheus and Grafana, and provides a step‑by‑step evolution plan and alerting strategy.

Distributed SystemsMetricsObservability
0 likes · 10 min read
How to Build a Cost‑Effective, Multi‑Layer Monitoring System for Distributed Applications
58 Tech
58 Tech
Jul 23, 2019 · Operations

Design and Implementation of an Open Alarm Platform for Monitoring Systems

The Open Alarm Platform provides a flexible data model, modular architecture, and robust stability features to enable various business lines to integrate their custom monitoring systems via APIs, offering alert convergence, merging, multi‑channel delivery, and comprehensive management while reducing development and maintenance costs.

Incident ManagementOperationsScalability
0 likes · 9 min read
Design and Implementation of an Open Alarm Platform for Monitoring Systems
Suning Technology
Suning Technology
Jul 17, 2019 · Artificial Intelligence

What the 2019 International AIOps Challenge Reveals About AI‑Driven Operations

The 2019 International AIOps Challenge, co‑hosted by Suning Technology, the China Computer Federation, Tsinghua, Nankai and Huawei, showcased AI‑powered solutions for KPI anomaly detection, highlighted academic‑industry collaboration, and underscored the growing impact of intelligent monitoring on modern IT operations.

AIIntelligent Operationsaiops
0 likes · 6 min read
What the 2019 International AIOps Challenge Reveals About AI‑Driven Operations
21CTO
21CTO
Jul 13, 2019 · Operations

How to Set Up Automated Linux Memory & Swap Monitoring with Email Alerts

Learn step‑by‑step how to install the msmtp email client, configure mutt, use the free command to monitor Linux memory and swap usage, write Bash scripts that log and email the results, and schedule these checks with cron for continuous system health alerts.

BashEmailLinux
0 likes · 7 min read
How to Set Up Automated Linux Memory & Swap Monitoring with Email Alerts
360 Tech Engineering
360 Tech Engineering
Jul 12, 2019 · Operations

StackStorm‑Based Monitoring Alert Auto‑Remediation Solution

This article introduces a StackStorm‑driven monitoring and alert auto‑remediation architecture that converges alarms, performs root‑cause analysis, and executes self‑healing actions, detailing its components, workflow, configuration examples, and real‑world deployment outcomes.

Auto‑RemediationOperations AutomationStackStorm
0 likes · 7 min read
StackStorm‑Based Monitoring Alert Auto‑Remediation Solution
Meitu Technology
Meitu Technology
Jul 9, 2019 · Backend Development

Performance Optimization Practices in Meitu XiuXiu Community

The Meitu XiuXiu community tackled rapid user‑growth‑induced performance bottlenecks by deploying end‑to‑end monitoring (client Hubble and RED‑based server metrics), full‑link load testing, DNS and image‑delivery optimizations, and server‑side tuning such as bias‑locking removal and JIT warm‑up, emphasizing user‑experience and cross‑team collaboration.

DNS OptimizationPerformance Optimizationbackend
0 likes · 25 min read
Performance Optimization Practices in Meitu XiuXiu Community
dbaplus Community
dbaplus Community
Jul 8, 2019 · Big Data

How to Use ClickHouse Sampling and Materialized Views for Real‑Time Monitoring of Billion‑Scale Ad Traffic

This article explains how to handle high‑volume advertising monitoring by storing raw request logs in ClickHouse, enabling sampling and materialized views, and using TP999 metrics, aggregating tables, and Grafana queries to achieve fast, flexible, and low‑impact real‑time analytics on billions of events.

ClickHouseSamplingbig-data
0 likes · 10 min read
How to Use ClickHouse Sampling and Materialized Views for Real‑Time Monitoring of Billion‑Scale Ad Traffic
Architecture Digest
Architecture Digest
Jul 8, 2019 · Backend Development

Evolution and Architecture of MaFengWo Payment Center (Version 1.0 → 2.0)

The article details the evolution of MaFengWo's payment center from a basic payment‑refund module (1.0) to a comprehensive, modular platform (2.0), describing its core capabilities, layered architecture, customizable checkout, routing management, monitoring system, and future micro‑service roadmap.

Backend ArchitectureScalabilitymonitoring
0 likes · 15 min read
Evolution and Architecture of MaFengWo Payment Center (Version 1.0 → 2.0)
System Architect Go
System Architect Go
Jul 5, 2019 · Backend Development

Key Monitoring Metrics for Node.js Applications and Open‑Source Tools

This article explains why monitoring is essential for Node.js applications, outlines the most important performance metrics such as CPU usage, memory usage, garbage collection, event‑loop latency, clustering, and request/response latency, and introduces several ready‑to‑use open‑source monitoring tools.

Node.jsOpen-sourcemonitoring
0 likes · 6 min read
Key Monitoring Metrics for Node.js Applications and Open‑Source Tools
Architecture Digest
Architecture Digest
Jul 5, 2019 · Operations

The Story of Elasticsearch and the Elastic Stack: From Origins to ELK

This article narrates the origin and evolution of Elasticsearch, its underlying Lucene technology, the surrounding Elastic Stack components such as Logstash, Kibana, and Beats, and illustrates how they together provide powerful search, logging, monitoring, and analytics solutions for modern applications.

BeatsElastic StackKibana
0 likes · 11 min read
The Story of Elasticsearch and the Elastic Stack: From Origins to ELK
Tencent IMWeb Frontend Team
Tencent IMWeb Frontend Team
Jul 4, 2019 · Cloud Computing

Migrating a Lightweight Web App to Serverless on Tencent Cloud: A Step‑by‑Step Guide

This article explains the fundamentals of Serverless architecture, its pros and cons, and provides a detailed, practical walkthrough for migrating a lightweight web application to Tencent Cloud's Serverless Cloud Function platform, covering architecture redesign, data storage, performance tuning, debugging, deployment, logging, and monitoring.

DebuggingDeploymentmonitoring
0 likes · 22 min read
Migrating a Lightweight Web App to Serverless on Tencent Cloud: A Step‑by‑Step Guide
ITPUB
ITPUB
Jul 2, 2019 · Databases

How ClickHouse Powers Ctrip’s Hotel Data Platform for Billions of Daily Updates

This article explains how Ctrip’s hotel data intelligence platform handles over ten billion daily data updates and nearly a million queries by adopting ClickHouse, detailing the system's background, the reasons for choosing ClickHouse over other solutions, the data ingestion pipelines, monitoring strategies, operational practices, and performance outcomes.

Big DataClickHousedata pipeline
0 likes · 13 min read
How ClickHouse Powers Ctrip’s Hotel Data Platform for Billions of Daily Updates
Java High-Performance Architecture
Java High-Performance Architecture
Jul 2, 2019 · Operations

How to Build Highly Available Systems: 8 Essential Strategies

This article outlines eight practical high‑availability techniques—multiple replicas, isolation, rate limiting, circuit breaking, degradation, gray releases with rollback, comprehensive monitoring, and proactive log alerting—to help engineers design systems that are both efficient and reliable under heavy load.

Circuit BreakerGray ReleaseRate Limiting
0 likes · 7 min read
How to Build Highly Available Systems: 8 Essential Strategies
Architecture Digest
Architecture Digest
Jul 2, 2019 · Fundamentals

Key Practices for High Availability, Isolation, and Data Consistency in Large‑Scale Internet Systems

The article outlines essential techniques for building highly available internet services, covering system availability metrics, multi‑level caching, database and service isolation, concurrency control, gray‑release deployment, comprehensive monitoring, graceful degradation, asynchronous design, and data‑consistency scenarios for both real‑time and offline big‑data workloads.

Data ConsistencySystem architecturehigh availability
0 likes · 8 min read
Key Practices for High Availability, Isolation, and Data Consistency in Large‑Scale Internet Systems
Tencent Cloud Developer
Tencent Cloud Developer
Jul 1, 2019 · Information Security

How to Detect and Prevent Cloud Data Leaks: Practical Strategies and Rule Configurations

This guide explains recent cloud‑based data‑leak incidents, categorizes common leak vectors, analyzes technical and managerial root causes, and provides actionable monitoring techniques, rule‑configuration examples, and incident‑response steps using Tencent Cloud Security Operations Center.

GitHubSecurity OperationsTencent Cloud
0 likes · 19 min read
How to Detect and Prevent Cloud Data Leaks: Practical Strategies and Rule Configurations
dbaplus Community
dbaplus Community
Jun 27, 2019 · Artificial Intelligence

How AI Powers Intelligent Multi-Modal Financial Data Quality Monitoring

This article presents the design, implementation, and evaluation of X‑monitor, an AI‑driven, adaptive, multi‑modal financial data quality monitoring platform that combines rule‑based and self‑learning strategies to improve detection efficiency, accuracy, and flexibility for large‑scale securities‑firm data streams.

AIMachine Learningbig-data
0 likes · 24 min read
How AI Powers Intelligent Multi-Modal Financial Data Quality Monitoring
Sohu Tech Products
Sohu Tech Products
Jun 26, 2019 · Operations

Distributed Tracing and Observability: Principles, OpenTracing Standard, and Open‑Source Solutions Comparison

This article explains how microservice complexity drives the need for observability, outlines its three pillars—logging, metrics, and tracing—describes OpenTracing concepts and APIs, and compares major open‑source distributed tracing systems to help engineers choose the right solution for fault localization, performance analysis, and capacity planning.

OpenTracingmonitoring
0 likes · 11 min read
Distributed Tracing and Observability: Principles, OpenTracing Standard, and Open‑Source Solutions Comparison
Architecture Digest
Architecture Digest
Jun 25, 2019 · Operations

Design and Implementation of a Unified Monitoring and Alert System for MaFengWo Large Transportation Business

This article describes the motivation, architecture, key components, rule engine, alert actions, and practical lessons learned while building a unified monitoring and alarm system for MaFengWo's large‑scale transportation platform, highlighting data collection, Elasticsearch storage, scheduling, and future enhancements.

ElasticsearchJavaalerting
0 likes · 13 min read
Design and Implementation of a Unified Monitoring and Alert System for MaFengWo Large Transportation Business
DevOps Cloud Academy
DevOps Cloud Academy
Jun 20, 2019 · Operations

Step-by-Step Installation and Configuration of Node Exporter, Alertmanager, Prometheus, and Grafana for Monitoring and Alerting

This guide walks through downloading, extracting, and setting up Node Exporter, Alertmanager, Prometheus, and Grafana on a Linux server, configuring their systemd services, customizing alert rules, and verifying the monitoring and alerting pipeline with screenshots of each verification step.

AlertmanagerGrafanaOperations
0 likes · 7 min read
Step-by-Step Installation and Configuration of Node Exporter, Alertmanager, Prometheus, and Grafana for Monitoring and Alerting
ITPUB
ITPUB
Jun 20, 2019 · Operations

Essential Ops Lessons: Avoid Disasters with Backups, Permissions, and Monitoring

This article shares hard‑earned operational guidelines for Linux servers, covering safe testing, cautious use of rm ‑rf, the importance of backups, strict access control, SSH hardening, firewall rules, intrusion detection, systematic monitoring, performance tuning, and maintaining a calm mindset to prevent costly incidents.

OperationsServer Administrationmonitoring
0 likes · 12 min read
Essential Ops Lessons: Avoid Disasters with Backups, Permissions, and Monitoring
Architecture Digest
Architecture Digest
Jun 19, 2019 · Big Data

Design and Optimization of Large‑Scale Log Systems for High‑Volume Data

This article examines the challenges of handling massive log data in large‑scale e‑commerce platforms, outlines a baseline ELK‑based architecture, discusses real‑time versus near‑real‑time requirements, and presents four optimization strategies—including basic tuning, platform scaling, data partitioning, and system degradation—to improve performance, resource utilization, and reliability.

ELKLog ManagementSystem Optimization
0 likes · 17 min read
Design and Optimization of Large‑Scale Log Systems for High‑Volume Data
Java Backend Technology
Java Backend Technology
Jun 19, 2019 · Backend Development

Enterprise Redis: Scaling, Monitoring, and Business Isolation

This article explores how enterprises can effectively use Redis by partitioning clusters for independent or shared use, addressing key naming conflicts, implementing graceful scaling with Zookeeper, monitoring performance via Open-Falcon, and quickly isolating problematic business traffic to maintain system stability.

Business IsolationClusterRedis
0 likes · 10 min read
Enterprise Redis: Scaling, Monitoring, and Business Isolation
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 18, 2019 · Operations

Why Designing for Failure Is the Key to Resilient Systems

The article explains how anticipating and engineering for diverse failure scenarios—from hardware faults and software bugs to traffic spikes and external attacks—can dramatically improve system reliability, reduce downtime, and protect business continuity in modern distributed and cloud environments.

disaster recoveryfailure designmonitoring
0 likes · 12 min read
Why Designing for Failure Is the Key to Resilient Systems
Meitu Technology
Meitu Technology
Jun 12, 2019 · Cloud Computing

Meitu's Cloud-Based Image Beautification and Large-Scale Video Processing Architecture

Meitu replaced on-device beautification and video processing with a cloud-native architecture that routes requests by region, uses a dedicated upload SDK for detailed monitoring, employs edge-computing, a configuration-driven plug-in framework and Kubernetes-based elastic scaling, enabling fast, reliable, globally-distributed image and video services.

Cloud ComputingMeituVideo processing
0 likes · 12 min read
Meitu's Cloud-Based Image Beautification and Large-Scale Video Processing Architecture
Architecture Digest
Architecture Digest
Jun 12, 2019 · Fundamentals

Comprehensive Guide to Distributed System Theory – Curated Article Collection

This resource compiles a complete series of articles on distributed system theory covering consistency, consensus, high availability, scalability, performance, testing, and operations, offering both quick overviews for newcomers and in‑depth readings for practitioners seeking to master modern distributed architectures.

ConsistencyScalabilityarchitecture
0 likes · 8 min read
Comprehensive Guide to Distributed System Theory – Curated Article Collection
DevOps Cloud Academy
DevOps Cloud Academy
Jun 9, 2019 · Operations

Prometheus Metric Definitions, Types, and Data Samples

This article explains Prometheus metric naming conventions, label usage, metric types such as Counter, Gauge, Summary, and Histogram, and describes the structure of data samples, providing examples and best‑practice guidelines for defining and classifying metrics in monitoring systems.

MetricsObservabilityOperations
0 likes · 5 min read
Prometheus Metric Definitions, Types, and Data Samples
dbaplus Community
dbaplus Community
Jun 3, 2019 · Operations

Top 5 Open‑Source Log Analysis Tools Every Ops Team Should Try

Monitoring network activity and ensuring compliance requires effective log analysis, and this article reviews five open‑source tools—Graylog, Nagios, Elastic Stack, LOGalyze, and Fluentd—detailing their features, strengths, and use cases for operations and security teams.

log analysismonitoring
0 likes · 11 min read
Top 5 Open‑Source Log Analysis Tools Every Ops Team Should Try
MaGe Linux Operations
MaGe Linux Operations
May 28, 2019 · Operations

What Skills and Knowledge Do You Need to Master Large‑Scale Website Operations?

This article explains what large‑scale website operations entail, outlines the product lifecycle and the crucial role of operations engineers, lists essential technical skills and personal qualities, and discusses current challenges, future prospects, and key technical topics such as cluster management, monitoring, fault handling, and automation.

AutomationCluster ManagementDevOps
0 likes · 18 min read
What Skills and Knowledge Do You Need to Master Large‑Scale Website Operations?
Big Data Technology & Architecture
Big Data Technology & Architecture
May 23, 2019 · Backend Development

Error Handling Strategies for Kafka Connectors: Immediate Stop, Silent Ignoring, and Dead‑Letter Queue

This article explains how to configure Kafka Connect error handling options—including stopping on failure, silently ignoring malformed messages, and routing failed records to a dead‑letter queue—while providing practical examples, monitoring techniques, and code snippets for robust data pipelines.

configurationdead letter queueerror-handling
0 likes · 21 min read
Error Handling Strategies for Kafka Connectors: Immediate Stop, Silent Ignoring, and Dead‑Letter Queue
dbaplus Community
dbaplus Community
May 22, 2019 · Operations

Designing a Scalable Monitoring System: From Data Collection to Alerting

This article explains how to build a comprehensive monitoring system for distributed applications by classifying monitoring functions, describing data quadrants, outlining core modules such as collection, processing, feature extraction, and visualization, and reviewing typical implementations for metrics, logs, tracing, alerting, and the key open‑source components involved.

Distributed SystemsMetricsTracing
0 likes · 18 min read
Designing a Scalable Monitoring System: From Data Collection to Alerting
Efficient Ops
Efficient Ops
May 21, 2019 · Operations

Essential Linux Ops Tools: Nethogs, IOZone, IOTop, and More

This guide introduces a dozen practical Linux operation tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, Fail2ban, Tmux, and others—providing concise descriptions, download links, and ready‑to‑run installation commands to help system administrators boost monitoring, performance testing, and security on their servers.

LinuxOperationsSecurity
0 likes · 12 min read
Essential Linux Ops Tools: Nethogs, IOZone, IOTop, and More
Architects' Tech Alliance
Architects' Tech Alliance
May 13, 2019 · Operations

Comprehensive Guide to System Monitoring: Objectives, Methods, Tools, Processes, and Best Practices

This article provides a thorough overview of system monitoring, covering its objectives, practical methods, core concepts, a comparison of popular open‑source and commercial tools, detailed monitoring processes (using Zabbix as an example), key metrics, alerting strategies, interview tips, and a summary of how organizations extend monitoring solutions.

alertingmonitoringzabbix
0 likes · 17 min read
Comprehensive Guide to System Monitoring: Objectives, Methods, Tools, Processes, and Best Practices
Qu Tech
Qu Tech
May 7, 2019 · Frontend Development

How to Pinpoint JavaScript Errors in Production Using Source Maps

This article explains how to use SourceMap files to trace minified JavaScript errors back to their original source lines, covering overall design, code examples, error reporting workflow, CI integration, storage strategies, and future monitoring enhancements.

error trackingfrontend debuggingmonitoring
0 likes · 7 min read
How to Pinpoint JavaScript Errors in Production Using Source Maps
Efficient Ops
Efficient Ops
May 6, 2019 · Operations

How Live Streaming Ops Ensure Real-Time Reliability at Scale

Zhang Guanshi, the operations director at Huya Live, shares how his team designs a hybrid‑cloud architecture, implements a six‑pillar reliability framework, and leverages real‑time monitoring, AIOps, and rapid‑recovery tools to maintain stable, low‑latency live video streams for millions of viewers.

Operationscloud architecturelive streaming
0 likes · 22 min read
How Live Streaming Ops Ensure Real-Time Reliability at Scale
Efficient Ops
Efficient Ops
May 5, 2019 · Operations

How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability

This article outlines Qunar's operational strategy for reducing failures and extending uptime through precise fault detection, rapid recovery, and AI-powered predictive health management, detailing the evolution of their OPS processes, practical implementations, and future challenges in applying PHM to internet services.

OperationsPHMaiops
0 likes · 18 min read
How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability
dbaplus Community
dbaplus Community
Apr 24, 2019 · Operations

Choosing and Tuning Open‑Source Monitoring Stacks for Large‑Scale Operations

This article reviews common open‑source monitoring tools, shares the evolution of China Unicom's big‑data platform monitoring, and provides practical guidance on selecting collectors, databases, and visualization components, with detailed configurations for Prometheus, Alertmanager, Grafana, and automation recovery techniques.

AlertmanagerGrafanaInfluxDB
0 likes · 19 min read
Choosing and Tuning Open‑Source Monitoring Stacks for Large‑Scale Operations
21CTO
21CTO
Apr 19, 2019 · Operations

From Junior to Senior Ops Engineer: Master the Skills to Level Up

This guide walks you through the entire career ladder of a senior operations engineer, covering essential Linux, networking, monitoring, container, automation, and security skills, while offering practical advice on job roles, learning paths, and professional growth.

ContainerizationDevOpsOperations
0 likes · 13 min read
From Junior to Senior Ops Engineer: Master the Skills to Level Up
ITPUB
ITPUB
Apr 19, 2019 · Operations

How to Level Up from Junior to Senior DevOps Engineer: A Complete Roadmap

This guide outlines the career stages, skill sets, and practical tasks for DevOps engineers—from entry‑level troubleshooting to senior‑level architecture, automation, and performance optimization—providing concrete learning paths, tools, and personal development advice to help engineers advance their operations careers.

AutomationContainerizationDevOps
0 likes · 12 min read
How to Level Up from Junior to Senior DevOps Engineer: A Complete Roadmap
Efficient Ops
Efficient Ops
Apr 18, 2019 · Operations

Choosing the Right Monitoring Stack: From Nagios to Prometheus & Grafana

This article reviews common open‑source monitoring combinations, compares their strengths and weaknesses, and shares practical guidance on selecting collectors, storage back‑ends, and visualization tools such as Telegraf, InfluxDB, Prometheus, Grafana, and alertmanager for large‑scale data platform operations.

GrafanaInfluxDBNagios
0 likes · 12 min read
Choosing the Right Monitoring Stack: From Nagios to Prometheus & Grafana
Mafengwo Technology
Mafengwo Technology
Apr 18, 2019 · Frontend Development

How to Build an Efficient Front‑End Monitoring Data Collection System

This article explains why front‑end monitoring is essential for user experience, outlines the key data types to collect, and provides practical AOP‑based implementations for route changes, JavaScript errors, performance metrics, resource failures, API calls, and reliable log reporting.

AOPFrontendJavaScript
0 likes · 14 min read
How to Build an Efficient Front‑End Monitoring Data Collection System
ITPUB
ITPUB
Apr 15, 2019 · Operations

Essential Practices to Prevent Operational Failures and Boost System Availability

This guide outlines six practical strategies—rollback testing, cautious destructive actions, clear command prompts, verified backups, careful handovers, and proactive monitoring—to help operations teams minimize outages and maintain high system availability.

AvailabilityIncident PreventionOperations
0 likes · 6 min read
Essential Practices to Prevent Operational Failures and Boost System Availability
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Apr 14, 2019 · Operations

8 Essential DevOps Skills Every Engineer Should Master

Shane Boulden, a Red Hat DevOps certification expert, outlines the eight most valuable DevOps skills—from mastering Kubernetes and micro‑service scaling to automation, container optimization, multi‑runtime interaction, identity management, OS expertise, and effective learning strategies—providing a practical roadmap for 2019 and beyond.

ContainersKubernetesci/cd
0 likes · 7 min read
8 Essential DevOps Skills Every Engineer Should Master
Efficient Ops
Efficient Ops
Apr 1, 2019 · Operations

Beyond Linux: Mastering Modern Operations – From Deployment to Cloud

This article explores the full spectrum of modern operations, covering environment deployment, troubleshooting, backup, high availability, monitoring, security, automation, virtualization, and cloud services, while highlighting essential tools and best practices for both Linux and Windows environments.

AutomationDeploymentOperations
0 likes · 8 min read
Beyond Linux: Mastering Modern Operations – From Deployment to Cloud
Efficient Ops
Efficient Ops
Mar 31, 2019 · Operations

How to Design Actionable Alerts and Effective Monitoring Strategies

This article explains why most alerts are poorly designed, defines actionable alerts, outlines monitoring objectives, discusses metric selection, and presents simple yet powerful algorithms for anomaly detection to improve system reliability and operational efficiency.

Anomaly DetectionMetricsObservability
0 likes · 21 min read
How to Design Actionable Alerts and Effective Monitoring Strategies
Architecture Digest
Architecture Digest
Mar 29, 2019 · Backend Development

Building Large-Scale Go Microservices at Toutiao: Architecture, Concurrency, Performance, and Monitoring

This article describes how Toutiao migrated its backend to Go, detailing the reasons for choosing Go, the design of a five‑tuple microservice architecture, concurrency models, timeout and performance optimizations, monitoring techniques, and engineering practices for large‑scale cloud‑native services.

cloud-nativemonitoringperformance
0 likes · 16 min read
Building Large-Scale Go Microservices at Toutiao: Architecture, Concurrency, Performance, and Monitoring
Ctrip Technology
Ctrip Technology
Mar 28, 2019 · Operations

Comprehensive Guide to Enterprise WiFi Planning, Deployment, and Operations – Practices from Ctrip

This article presents a detailed, practice‑driven guide for enterprise WiFi, covering network planning, full‑coverage design, channel optimization, security, KPI‑based monitoring, probe‑based measurement, troubleshooting techniques, and real‑world case studies from Ctrip, highlighting how systematic operations can ensure high‑quality wireless service.

OperationsWiFicase study
0 likes · 16 min read
Comprehensive Guide to Enterprise WiFi Planning, Deployment, and Operations – Practices from Ctrip
58 Tech
58 Tech
Mar 25, 2019 · Artificial Intelligence

Machine Learning‑Based Threshold‑Free Monitoring for Business Metrics

This article describes a monitoring system that leverages machine learning to perform threshold‑free, real‑time anomaly detection on macro business indicators such as network traffic and access volume, detailing its architecture, sample labeling, model training, and multi‑level alarm strategies.

AIAnomaly DetectionMachine Learning
0 likes · 7 min read
Machine Learning‑Based Threshold‑Free Monitoring for Business Metrics
58 Tech
58 Tech
Mar 25, 2019 · Operations

Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform

The article describes how the 58 monitoring platform reduces alarm storms through alarm convergence, intelligent merging using Gini‑based decision trees, and automated self‑healing, thereby improving alert quality, cutting noise by about 70%, and helping engineers resolve incidents faster.

Operationsalarm convergencealert merging
0 likes · 9 min read
Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform
Efficient Ops
Efficient Ops
Mar 23, 2019 · Operations

How to Build a Bank Ops SWAT Team for 5‑Minute Incident Recovery

This article explains how a bank can create a specialized Operations SWAT team, define its role, adopt seven essential “weapons” such as layered monitoring, intelligent alerts, communication protocols, automation, and disaster‑recovery tactics, and continuously train the team to meet strict five‑minute recovery targets.

AutomationIncident ResponseSWAT team
0 likes · 21 min read
How to Build a Bank Ops SWAT Team for 5‑Minute Incident Recovery
Tencent Music Tech Team
Tencent Music Tech Team
Mar 22, 2019 · Frontend Development

How to Build a Frontend User‑Behavior Tracing System for Debugging External Network Issues

This article analyzes the challenges of reproducing external‑network bugs, outlines common failure causes, and presents a complete design for a JavaScript SDK that records environment data, AJAX calls, errors, and user actions, stores them in IndexedDB, and visualizes the timeline for efficient troubleshooting.

DebuggingFrontendIndexedDB
0 likes · 15 min read
How to Build a Frontend User‑Behavior Tracing System for Debugging External Network Issues
转转QA
转转QA
Mar 20, 2019 · Operations

Real-time Monitoring of H5 Pages Using Headless Browser and Puppeteer

This article describes a real‑time monitoring solution for large numbers of H5 pages that combines Python's Requests library for data crawling with a headless Chrome browser driven by Puppeteer to detect resource errors, API failures, and DOM anomalies, automatically alerting stakeholders.

AutomationHeadless BrowserNode.js
0 likes · 8 min read
Real-time Monitoring of H5 Pages Using Headless Browser and Puppeteer
Efficient Ops
Efficient Ops
Mar 18, 2019 · Operations

How to Build a Bank Ops SWAT Team for Rapid Incident Recovery

This article explains how a bank can create a specialized SWAT‑style operations team, define its roles, adopt seven essential "weapons" such as monitoring and intelligent alerts, and apply ten tactical processes—from communication to automation—to meet strict five‑minute recovery and regulatory requirements.

AutomationIncident ResponseSWAT team
0 likes · 21 min read
How to Build a Bank Ops SWAT Team for Rapid Incident Recovery
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 18, 2019 · Operations

Alibaba Hema’s 7‑Layer Funnel & 23 Tactics for Ultra‑Fast Delivery Stability

The article outlines Alibaba’s Hema delivery platform’s end‑to‑end stability strategy, detailing a 7‑layer funnel review process, three core norms (development, architecture, stability), and 23 practical tactics—including core‑noncore isolation, proactive monitoring, fault prevention, rapid recovery, and service‑level controls—to ensure reliable 30‑minute deliveries despite complex logistics and external disruptions.

Operationsarchitecturedelivery
0 likes · 13 min read
Alibaba Hema’s 7‑Layer Funnel & 23 Tactics for Ultra‑Fast Delivery Stability
QQ Music Frontend Team
QQ Music Frontend Team
Mar 17, 2019 · Frontend Development

How to Build a Front‑End User Behavior Tracing System for Faster Issue Diagnosis

This article explains the design and implementation of a front‑end user behavior tracing system, covering common external network problems, the importance of collecting runtime environment, data, JS errors, and interaction logs, and detailing SDK data collection, reporting strategies, server processing, and query platform visualization.

IndexedDBUser Behavior Trackingajax
0 likes · 14 min read
How to Build a Front‑End User Behavior Tracing System for Faster Issue Diagnosis
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 15, 2019 · Cloud Computing

Design and Architecture of QLive Large‑Scale Live Streaming Service

The QLive service powers iQIYI’s massive live‑streaming events—such as the Spring Festival Gala—by combining vertical and horizontal scaling, a three‑layer architecture with dual data‑center isolation, multi‑level caching, circuit‑breaker/degradation controls, and a Flume‑Kafka‑Hive monitoring pipeline to sustain over 400 k QPS and 99.9999 % availability.

CachingVertical Scalingfault tolerance
0 likes · 9 min read
Design and Architecture of QLive Large‑Scale Live Streaming Service
Xianyu Technology
Xianyu Technology
Mar 14, 2019 · Operations

Ensuring High Availability of Search Engine Services: A Case Study of Xianyu's Search System

The article explains how Xianyu guarantees high‑availability of its core Ha3‑based search engine through independent gateway deployment, multi‑datacenter disaster recovery, traffic isolation, comprehensive monitoring, pressure testing, gray releases, and automated/manual failover, enabling rapid issue detection, recovery, and continuous service stability.

Gray ReleaseSystem architecturedisaster recovery
0 likes · 19 min read
Ensuring High Availability of Search Engine Services: A Case Study of Xianyu's Search System
JD Tech
JD Tech
Mar 13, 2019 · Operations

Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3

The article chronicles the design, evolution, and lessons learned of JD Digital Technology’s self‑built host monitoring platform “DiTing”, detailing its initial requirements, V1 architecture, subsequent V2 and V3 redesigns, encountered challenges, and future directions toward intelligent operations.

Big DataOperationsSystem architecture
0 likes · 12 min read
Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3
Efficient Ops
Efficient Ops
Mar 10, 2019 · Operations

Essential Linux and Java Debugging Tools for Rapid Issue Diagnosis

This guide compiles a practical toolbox of Linux commands and Java utilities—including tail, grep, awk, find, tsar, jstack, jmap, jstat, btrace, Greys, JProfiler, and RateLimiter—to help engineers quickly locate, analyze, and resolve performance and stability problems in production environments.

Debuggingmonitoringtools
0 likes · 12 min read
Essential Linux and Java Debugging Tools for Rapid Issue Diagnosis
dbaplus Community
dbaplus Community
Mar 10, 2019 · Operations

How Alibaba’s Table Store Auto‑Solves Hotspot Issues with Real‑Time Load Balancing

This article explains the architecture and mechanisms of Alibaba Cloud's Table Store load‑balancing system, detailing how it collects metrics, detects user‑access and machine hotspots, and automatically applies actions such as partition moves, splits, merges, and isolation to maintain high availability and performance.

Alibaba CloudNoSQLhotspot mitigation
0 likes · 17 min read
How Alibaba’s Table Store Auto‑Solves Hotspot Issues with Real‑Time Load Balancing
Efficient Ops
Efficient Ops
Mar 6, 2019 · Databases

How NetEase Built an Automated DBA Platform with AIOps for Massive Scale

This article details NetEase's journey in designing and implementing a large‑scale database automation platform, covering its requirements, tool‑based operations, architecture, AIOps integration, and the practical lessons learned for managing thousands of database clusters efficiently.

OperationsScalabilityaiops
0 likes · 20 min read
How NetEase Built an Automated DBA Platform with AIOps for Massive Scale
HomeTech
HomeTech
Feb 28, 2019 · Artificial Intelligence

How to Systematically Test and Monitor AI Models in Large‑Scale Production

This article presents a comprehensive approach to testing, automating, and monitoring AI prediction models in a high‑traffic environment, covering background, challenges, evaluation metrics, data sampling methods, automated test scripts, and online monitoring to ensure model accuracy, performance, and reliability.

AI testingAutomationBig Data
0 likes · 13 min read
How to Systematically Test and Monitor AI Models in Large‑Scale Production