Tagged articles
2195 articles
Page 17 of 22
DevOps
DevOps
Apr 8, 2020 · Operations

Bilibili DevOps Case Study: Culture, Community, User‑Driven Demand Management, High‑Performance Microservices, and Data Operations

This article presents a comprehensive DevOps case study of Bilibili, covering its cultural background, community ecosystem, user‑centric demand management, migration to high‑performance microservices, and the implementation of logging, monitoring, and real‑time data platforms to support rapid, reliable delivery.

BilibiliData PlatformDevOps
0 likes · 17 min read
Bilibili DevOps Case Study: Culture, Community, User‑Driven Demand Management, High‑Performance Microservices, and Data Operations
Efficient Ops
Efficient Ops
Apr 6, 2020 · Databases

How to Build a MySQL Monitoring Platform with Prometheus and Grafana

This article walks through setting up a production‑grade MySQL monitoring solution using Prometheus and Grafana, covering exporter installation, MySQL user configuration, systemd service setup, Prometheus job definition, key MySQL performance metrics, and basic alerting rules.

GrafanaMetricsMySQL
0 likes · 15 min read
How to Build a MySQL Monitoring Platform with Prometheus and Grafana
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Apr 5, 2020 · Backend Development

Master Spring Boot Actuator: Quick Start, Key Endpoints, and Security

This tutorial walks through what Spring Boot Actuator is, how to quickly create a demo project, configure endpoint exposure, explore essential endpoints such as health, metrics, loggers, and shutdown, and secure them with Spring Security, providing code snippets and configuration examples.

ActuatorBackend DevelopmentEndpoints
0 likes · 14 min read
Master Spring Boot Actuator: Quick Start, Key Endpoints, and Security
360 Quality & Efficiency
360 Quality & Efficiency
Apr 3, 2020 · Operations

Prometheus Monitoring System: Concepts, Architecture, and Hands‑On Deployment with Node Exporter and Grafana

This article introduces the core concepts and architecture of the open‑source Prometheus monitoring system, explains its data model and metric types, and provides a step‑by‑step guide to install a Prometheus server, collect host metrics with Node Exporter, and visualize them using Grafana.

GrafanaMetricsObservability
0 likes · 10 min read
Prometheus Monitoring System: Concepts, Architecture, and Hands‑On Deployment with Node Exporter and Grafana
Efficient Ops
Efficient Ops
Apr 1, 2020 · Operations

How to Use Nagios for Business-Level Service Monitoring: A Step-by-Step Guide

This article explains why traditional server and service monitoring (e.g., Zabbix) may miss business outages, then walks through setting up Nagios on Debian to monitor web page URLs, API health checks, and related services, including configuration files, plugins, and a desktop alert tool, Nagstamon.

LinuxNagiosbusiness availability
0 likes · 18 min read
How to Use Nagios for Business-Level Service Monitoring: A Step-by-Step Guide
Alibaba Terminal Technology
Alibaba Terminal Technology
Apr 1, 2020 · Frontend Development

How to Build a Robust Frontend Safety Production System for High‑Reliability Web Apps

This article explains the concept of frontend safety production, outlines its evolution from basic monitoring to a systematic, cloud‑enabled framework, and details the core capabilities—pre‑change CI checks, gray‑release gating, and real‑time monitoring—required to ensure high‑quality, risk‑free frontend deployments.

CIFrontendRisk Assessment
0 likes · 12 min read
How to Build a Robust Frontend Safety Production System for High‑Reliability Web Apps
Java Captain
Java Captain
Apr 1, 2020 · Operations

Comprehensive Guide to Online Environment Deployment and Operations Practices

This article provides a thorough overview of planning, provisioning, and managing online production environments—including user sizing, bandwidth estimation, database design, OS versus container deployment, middleware selection, security, monitoring, SSH shortcuts, file transfer tools, automation scripts, Docker setup, and log viewing techniques—aimed at giving developers a complete operational perspective.

DeploymentDockerOperations
0 likes · 16 min read
Comprehensive Guide to Online Environment Deployment and Operations Practices
FunTester
FunTester
Mar 31, 2020 · Operations

Interface Performance Testing – Tools, Scripts, and Guides

This article compiles a comprehensive list of resources—including tools, scripts, and tutorials—for conducting interface performance testing on Linux and other platforms, covering topics such as netdata localization, timewatch utility, load testing strategies, JVM heap dumps, and visualizing test data.

APILinuxmonitoring
0 likes · 6 min read
Interface Performance Testing – Tools, Scripts, and Guides
Continuous Delivery 2.0
Continuous Delivery 2.0
Mar 30, 2020 · Operations

Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling

The article explains how Facebook manages dynamic runtime configuration for millions of services—covering feature gating, experiments, traffic control, topology balancing, monitoring, machine‑learning model updates, and internal behavior—using a suite of tools such as Configerator, Gatekeeper, Package Vessel, Sitevars, and MobileConfig.

AB testingcloud operationsconfiguration-management
0 likes · 8 min read
Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling
Efficient Ops
Efficient Ops
Mar 26, 2020 · Operations

Why SRE Exists and How It Solves Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how it addresses reliability challenges through decoupling, SLO‑driven monitoring, and scenario‑based drills, while highlighting key observations and focus areas for modern operations teams.

SLOSREmonitoring
0 likes · 13 min read
Why SRE Exists and How It Solves Reliability Challenges
Ops Development Stories
Ops Development Stories
Mar 26, 2020 · Operations

How to Auto‑Discover and Monitor Redis Ports with Zabbix

This guide explains how to use Zabbix's auto‑discovery feature to automatically find Redis instances on a server, create shell or Python scripts for port detection, configure Zabbix agent keys, set up server‑side templates, discovery rules, item prototypes, graphs, and triggers, and finally apply the template to monitored hosts.

Auto-discoveryPythonRedis
0 likes · 9 min read
How to Auto‑Discover and Monitor Redis Ports with Zabbix
Efficient Ops
Efficient Ops
Mar 25, 2020 · Operations

How JD Logistics Built a 300‑Million‑Metric Real‑Time Monitoring System for 99.999% Uptime

This article details JD Logistics' journey to design and implement a massive, AI‑enhanced monitoring platform that handles over three million metrics across hundreds of warehouses, addressing challenges of scale, network complexity, frequent asset changes, and integrating AIOps for proactive fault detection and resolution.

Anomaly DetectionCMDBKafka
0 likes · 23 min read
How JD Logistics Built a 300‑Million‑Metric Real‑Time Monitoring System for 99.999% Uptime
Didi Tech
Didi Tech
Mar 21, 2020 · Operations

Why Didi’s Nightingale Is Redefining Cloud‑Native Monitoring

Nightingale, Didi’s open‑source enterprise monitoring platform, builds on Open‑Falcon but adds a hierarchical object tree, in‑memory indexing, Gorilla‑compressed time‑series storage, a hybrid push‑pull alert engine, built‑in log monitoring, and a unified monapi module, delivering scalable, cloud‑native observability for both container and bare‑metal workloads.

Cloud NativeObservabilityOpen-Falcon
0 likes · 10 min read
Why Didi’s Nightingale Is Redefining Cloud‑Native Monitoring
Open Source Linux
Open Source Linux
Mar 19, 2020 · Operations

Essential Ops Playbook: Avoid Costly Mistakes in Server Management

This guide shares practical Linux server operation rules, emphasizing thorough testing, careful use of destructive commands, strict access control, regular backups, security hardening, continuous monitoring, and disciplined performance tuning to prevent costly outages and data loss.

Performance tuningbackupmonitoring
0 likes · 13 min read
Essential Ops Playbook: Avoid Costly Mistakes in Server Management
Efficient Ops
Efficient Ops
Mar 8, 2020 · Operations

Prometheus vs Zabbix: Install, Configure & Visualize with Grafana

This article compares Prometheus with Zabbix, walks through downloading and installing Prometheus, explains the key sections of prometheus.yml, shows how to add a node_exporter for machine metrics, and demonstrates integrating Grafana to create rich monitoring dashboards.

GrafanaLinuxPrometheus
0 likes · 11 min read
Prometheus vs Zabbix: Install, Configure & Visualize with Grafana
Didi Tech
Didi Tech
Mar 5, 2020 · R&D Management

Lean Development Practices and DevOps Implementation at Didi: Coding, Testing, Monitoring, and Ecosystem

At Didi, lean‑production ideas are woven into DevOps by establishing coding standards with SemVer and the NUWA framework, introducing traffic‑recording replay and a sim‑sidecar for realistic testing, extending monitoring with fine‑grained metrics, and unifying these practices into an ecosystem that cuts waste, speeds releases, and boosts overall software quality.

Frameworklean developmentmonitoring
0 likes · 7 min read
Lean Development Practices and DevOps Implementation at Didi: Coding, Testing, Monitoring, and Ecosystem
Efficient Ops
Efficient Ops
Mar 4, 2020 · Operations

Master Zabbix: From Installation to Advanced Custom Monitoring

This guide explains why monitoring is essential, describes the concept of availability "X nines," walks through Zabbix installation, web interface setup, host and template configuration, custom monitoring, alerting with OneAlert, visualization, distributed monitoring, SNMP integration, and provides practical command examples for managing large server fleets.

Linuxautomationmonitoring
0 likes · 20 min read
Master Zabbix: From Installation to Advanced Custom Monitoring
Tencent IMWeb Frontend Team
Tencent IMWeb Frontend Team
Mar 4, 2020 · Frontend Development

How Tencent Classroom’s Front‑End Team Survived Pandemic Traffic Surges

During the COVID‑19 pandemic, Tencent Classroom’s front‑end team faced unprecedented traffic spikes, forcing rapid decisions on domain stability, video streaming, data platforms, messaging, monitoring, and deployment pipelines, while sharing lessons on scaling, resilience, and collaborative development under extreme pressure.

DeploymentFrontendScaling
0 likes · 13 min read
How Tencent Classroom’s Front‑End Team Survived Pandemic Traffic Surges
Programmer DD
Programmer DD
Mar 4, 2020 · Frontend Development

Customize Grafana Themes Without Rebuilding the Source Code

This guide walks you through a step‑by‑step method to add and switch custom Grafana themes using the Boom Theme panel plugin and ready‑made theme packs from GitHub, enabling theme changes across dashboards without modifying Grafana's source code.

Frontend DevelopmentGrafanaTheme Customization
0 likes · 5 min read
Customize Grafana Themes Without Rebuilding the Source Code
Qunar Tech Salon
Qunar Tech Salon
Feb 20, 2020 · Operations

Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud

This article explains why monitoring is essential for operations, outlines the four‑layer monitoring standard (infrastructure, liveliness, performance, business), breaks down functional modules and data flows, and showcases JD Cloud's practical design, alarm‑convergence project, and future AI‑driven observability directions.

JD CloudObservabilityOperations
0 likes · 12 min read
Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud
Product Technology Team
Product Technology Team
Feb 19, 2020 · Frontend Development

How Zhenkun Built a Unified Frontend Tech Stack for Rapid Scaling

This article details how Zhenkun's frontend team responded to fast business growth by unifying their tech stack—introducing a private npm registry, a custom CLI scaffolding tool, Node.js backend, mock services, standardized webpack builds, DevOps automation, static resource delivery, monitoring, visual editors, UI component libraries, and automated testing—to boost development efficiency and maintainability across multiple locations.

DevOpsFrontendautomation
0 likes · 15 min read
How Zhenkun Built a Unified Frontend Tech Stack for Rapid Scaling
Didi Tech
Didi Tech
Feb 18, 2020 · Operations

Didi's National Carpool Day: Technical Insights into Stability Assurance

Didi's National Carpool Day on Dec 3 2019 attracted 3.1M passengers; stability ensured via six pillars: organized task force, capacity forecasting and rapid container scaling, comprehensive monitoring with fire‑fighting map, robust contingency platform, strict process standards, and coordinated third‑party preparation.

Carpool DayDidiOperations
0 likes · 13 min read
Didi's National Carpool Day: Technical Insights into Stability Assurance
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 18, 2020 · Cloud Native

Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook

This article explains why online applications experience crashes during traffic spikes, outlines the complexity of modern cloud‑based service architectures, and shares Alibaba engineers’ practical notes on high‑availability design, capacity planning, full‑link stress testing, monitoring, traffic control, routine inspections, and chaos‑engineering drills using tools such as AHAS, PTS, Sentinel and Advisor.

Alibaba Cloudcapacity planningchaos engineering
0 likes · 12 min read
Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook
Efficient Ops
Efficient Ops
Feb 17, 2020 · Operations

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

This article outlines how Ping An’s IT operations team systematically prepares for high‑traffic business events—detailing service assessment, architecture mapping, configuration audits, monitoring design, capacity planning, stress testing, and coordinated incident response—to guarantee reliability and performance under massive concurrent loads.

IT OperationsIncident ResponsePerformance Optimization
0 likes · 15 min read
How Top IT Ops Teams Ensure Seamless Large-Scale Business Events
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 17, 2020 · Operations

How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability

This article details Hema's approach to guaranteeing system stability for its offline and delivery operations, covering the complete smart‑dispatch architecture, exhaustive dependency analysis, database and middleware safeguards, monitoring strategies, gray‑release practices, testing methods, and emergency response procedures that together enabled a year of zero failures.

Backend ArchitectureDatabase Optimizationmicroservices
0 likes · 24 min read
How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability
ITPUB
ITPUB
Feb 10, 2020 · Operations

Essential Linux and Java Debugging Commands for Rapid Issue Diagnosis

This guide compiles a practical collection of Linux command‑line tricks and Java troubleshooting tools—such as tail, grep, awk, find, tsar, btrace, Greys, jstack, jmap and more—complete with usage examples, code snippets and visual outputs to help engineers quickly diagnose and resolve production problems.

DebuggingTroubleshootingmonitoring
0 likes · 17 min read
Essential Linux and Java Debugging Commands for Rapid Issue Diagnosis
Architects' Tech Alliance
Architects' Tech Alliance
Feb 4, 2020 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article walks through the transformation of an online supermarket from a simple monolithic website to a fully fledged microservice architecture, highlighting the motivations, design decisions, common pitfalls, and essential components such as monitoring, tracing, logging, gateways, service discovery, circuit breaking, testing strategies, and service mesh adoption.

DeploymentService Mesharchitecture
0 likes · 22 min read
Microservice Architecture Evolution: From Monolith to Service Mesh
Big Data Technology Architecture
Big Data Technology Architecture
Jan 31, 2020 · Big Data

Practical Experience with HBase at NetEase: Architecture, Core Use Cases, HBCK & RIT Troubleshooting, and Diagnosis Strategies

This article summarizes NetEase Hangzhou Research Institute expert Fan Xinxin's presentation on HBase, covering its role in the big‑data ecosystem, core production scenarios, RIT and HBCK troubleshooting techniques, and systematic monitoring and log‑analysis methods for diagnosing HBase issues.

HBCKHBaseRIT
0 likes · 11 min read
Practical Experience with HBase at NetEase: Architecture, Core Use Cases, HBCK & RIT Troubleshooting, and Diagnosis Strategies
Java Backend Technology
Java Backend Technology
Jan 23, 2020 · Backend Development

Master Spring Boot Actuator: Real‑Time Monitoring, Metrics, and Dynamic Log Levels

This tutorial walks you through using Spring Boot Actuator to monitor microservice applications, covering quick setup, essential endpoints such as health, metrics, loggers, and shutdown, customizing health indicators, dynamically changing log levels at runtime, and securing actuator endpoints with Spring Security.

ActuatorMetricsSecurity
0 likes · 14 min read
Master Spring Boot Actuator: Real‑Time Monitoring, Metrics, and Dynamic Log Levels
dbaplus Community
dbaplus Community
Jan 22, 2020 · Backend Development

How to Simulate 100 Billion WeChat Red‑Packet Requests on a Single Server

This article details a practical experiment that reproduces the load of 100 billion WeChat red‑packet (shake‑and‑grab) requests by simulating 1 million concurrent users on a single machine, achieving peak QPS of 60 k and demonstrating the architectural choices, hardware setup, and monitoring techniques required for such high‑throughput backend systems.

GoQPShigh‑throughput
0 likes · 18 min read
How to Simulate 100 Billion WeChat Red‑Packet Requests on a Single Server
Alibaba Cloud Native
Alibaba Cloud Native
Jan 22, 2020 · Backend Development

Mastering Microservices: RPC, Service Discovery, Config, Scheduling & More

This comprehensive guide explains the benefits of microservices and walks through core building blocks such as RPC, service discovery, configuration management, task scheduling, distributed locking, unified monitoring, caching strategies, message queues, distributed transactions, CAP theory, seckill handling, Docker isolation, and modern CI/CD deployment pipelines.

Configuration ManagementTask Schedulingbackend
0 likes · 24 min read
Mastering Microservices: RPC, Service Discovery, Config, Scheduling & More
JD Retail Technology
JD Retail Technology
Jan 16, 2020 · Backend Development

Architecture and Key Technologies of a Scalable Message Push Platform

The document outlines the design, key components, data flow, and operational strategies of a large‑scale message push platform, detailing its architecture, request handling, long‑connection management, retry mechanisms, data statistics, monitoring, and future expansion plans.

Backend ArchitectureData AnalyticsLong Connections
0 likes · 15 min read
Architecture and Key Technologies of a Scalable Message Push Platform
Architecture Digest
Architecture Digest
Jan 14, 2020 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article walks through the evolution of an online supermarket from a simple monolithic website to a fully split microservice system, highlighting the motivations, architectural changes, common pitfalls, and practical solutions such as monitoring, tracing, service discovery, circuit breaking, testing, and the eventual adoption of a service mesh.

Service Mesharchitecturemicroservices
0 likes · 22 min read
Microservice Architecture Evolution: From Monolith to Service Mesh
Architecture Digest
Architecture Digest
Jan 12, 2020 · Backend Development

Understanding Microservices Architecture: Concepts, Benefits, and Core Components

This article explains the fundamentals of microservices architecture, detailing its definition, core principles such as small independent services and lightweight communication, the advantages and drawbacks, suitable organizational contexts, and the essential technical components like service discovery, gateways, configuration centers, monitoring, circuit breaking, and container orchestration.

architecturegatewaymicroservices
0 likes · 15 min read
Understanding Microservices Architecture: Concepts, Benefits, and Core Components
JD Retail Technology
JD Retail Technology
Jan 8, 2020 · Operations

Comprehensive Guide to E‑commerce Promotion Traffic Management and System Preparation

This article explains how e‑commerce promotions differ from offline sales by offering lower participation thresholds and flexible discount tactics, outlines methods for estimating and handling traffic spikes, and provides detailed strategies for system capacity planning, load testing, monitoring, and incident response to ensure stable large‑scale promotional events.

Scalingcapacity planninge‑commerce
0 likes · 23 min read
Comprehensive Guide to E‑commerce Promotion Traffic Management and System Preparation
360 Tech Engineering
360 Tech Engineering
Jan 7, 2020 · Operations

Introduction to Prometheus and Grafana for Monitoring and Alerting

This article provides a comprehensive overview of using Prometheus and Grafana for metric collection, storage, querying with PromQL, visualization, and alerting, including exporter integration, metric types, high‑availability setups, and practical examples for modern microservice architectures.

GrafanaMetricsPrometheus
0 likes · 10 min read
Introduction to Prometheus and Grafana for Monitoring and Alerting
Efficient Ops
Efficient Ops
Dec 29, 2019 · Operations

Master Linux Performance: Tools & Flame Graphs for Fast Issue Diagnosis

This article presents a comprehensive guide to Linux performance analysis, covering CPU, memory, disk I/O, network, system load, flame‑graph techniques, and a real‑world Nginx case study, enabling engineers to quickly locate and resolve bottlenecks.

CPU profilingLinuxSystem Optimization
0 likes · 19 min read
Master Linux Performance: Tools & Flame Graphs for Fast Issue Diagnosis
Tencent Cloud Developer
Tencent Cloud Developer
Dec 27, 2019 · Cloud Computing

Tencent Classroom Video Migration to Tencent Cloud: Architecture, Implementation, and Lessons Learned

Tencent Classroom migrated roughly four million videos (about 1,500 TB) to Tencent Cloud in a two‑phase rollout that integrated cloud upload, transcoding, encrypted HLS playback with anti‑leech and DRM, added AI‑based content moderation, resolved SDK and multi‑region issues, and built a custom mini‑program player, ultimately boosting upload success rates, playback reliability, and security.

DRMHLS encryptionTencent Cloud
0 likes · 13 min read
Tencent Classroom Video Migration to Tencent Cloud: Architecture, Implementation, and Lessons Learned
Qunar Tech Salon
Qunar Tech Salon
Dec 27, 2019 · Operations

Qunar Ticket Test‑Environment Governance and Automated Monitoring Framework

This article describes Qunar Ticket’s comprehensive test‑environment governance framework, including the “Mirror‑Inspect” monitoring service, configuration and data synchronization strategies, and automated allocation management, highlighting how these practices reduced environment‑related project delays from up to 20% to below 8%.

Configuration ManagementOperationsmonitoring
0 likes · 11 min read
Qunar Ticket Test‑Environment Governance and Automated Monitoring Framework
Aikesheng Open Source Community
Aikesheng Open Source Community
Dec 25, 2019 · Operations

Deploying Thanos for Unified Prometheus Monitoring and Long‑Term Storage

This guide explains the background, key features, architecture, and step‑by‑step deployment of Thanos—including Sidecar, Store, Query, Compact, Bucket, Rule, and Check components—to provide a unified, high‑availability Prometheus monitoring view with unlimited historical data storage using object storage.

Cloud NativeDeploymentLong‑term Storage
0 likes · 9 min read
Deploying Thanos for Unified Prometheus Monitoring and Long‑Term Storage
Efficient Ops
Efficient Ops
Dec 22, 2019 · Operations

How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale

This article examines Baidu’s Noah monitoring and alarm platform, detailing its end‑to‑end fault‑handling workflow, the three‑component architecture, and the practical challenges of deploying AIOps—such as long algorithm iteration cycles, complex alarm management, and alarm storms—while highlighting scalability and commercial considerations.

Alarm ManagementOperationsaiops
0 likes · 15 min read
How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale
Efficient Ops
Efficient Ops
Dec 19, 2019 · Operations

AIOps in Banking: Veteran’s Secrets to Smarter Operations

In this interview, veteran Bank of China software center analyst Yuan Chunliang shares two decades of experience, detailing how the bank’s shift to distributed core banking systems sparked the development of AIOps practices such as no‑threshold intelligent monitoring, multi‑indicator analytics, and AI‑driven ticket automation to boost operational efficiency and reduce risk.

Banking TechnologyIT Operationsaiops
0 likes · 14 min read
AIOps in Banking: Veteran’s Secrets to Smarter Operations
Programmer DD
Programmer DD
Dec 19, 2019 · Backend Development

Why Microservices Matter: Core Principles, Benefits, and Architecture Explained

This article introduces the fundamental concepts of microservices, covering their definition, advantages, design principles, core components such as service discovery, gateways, configuration centers, monitoring, circuit breaking, and container orchestration, while also discussing suitable organizational structures and practical implementation details.

container orchestrationgatewaymicroservices
0 likes · 21 min read
Why Microservices Matter: Core Principles, Benefits, and Architecture Explained
MaGe Linux Operations
MaGe Linux Operations
Dec 18, 2019 · Operations

Mastering Modern IT Operations: Roles, Practices, and Evolution

This article outlines the comprehensive responsibilities and evolution of IT operations, covering system, application, database, security, and platform management, detailing tasks such as infrastructure building, monitoring, optimization, automation, and the shift from manual processes to self‑scheduling systems.

IT OperationsInfrastructureSystem Administration
0 likes · 20 min read
Mastering Modern IT Operations: Roles, Practices, and Evolution
dbaplus Community
dbaplus Community
Dec 17, 2019 · Artificial Intelligence

How to Build a Scalable Intelligent Dispatch System for 400K Daily Orders

This article walks through the evolution of a ride‑hailing platform’s dispatch system—from a single‑database prototype to a data‑driven, AI‑powered architecture—detailing architectural choices, big‑data pipelines, model training, real‑time scheduling strategies, and monitoring practices for handling 400,000 daily orders.

AIDispatchRide Hailing
0 likes · 11 min read
How to Build a Scalable Intelligent Dispatch System for 400K Daily Orders
360 Tech Engineering
360 Tech Engineering
Dec 17, 2019 · Backend Development

Diagnosing Java Memory Leaks: JVM GC Roots, Monitoring, and Code Fixes

This article explains how Java memory leaks can occur despite automatic garbage collection, describes JVM GC‑Root analysis, outlines practical monitoring with Spring Boot Actuator, Prometheus, and Grafana, and provides step‑by‑step debugging commands and code adjustments to locate and fix the leak.

Garbage CollectionJVMJava
0 likes · 10 min read
Diagnosing Java Memory Leaks: JVM GC Roots, Monitoring, and Code Fixes
WecTeam
WecTeam
Dec 17, 2019 · Frontend Development

How JD Optimized Its WeChat Shopping Homepage for Lightning‑Fast Performance

By combining server‑side rendering, critical‑render‑path tuning, resource minification, image format upgrades, and RAIL‑based multi‑dimensional monitoring, JD dramatically reduced its WeChat shopping homepage’s first‑screen load time, offering a practical roadmap for front‑end performance optimization.

FrontendImage OptimizationRAIL model
0 likes · 17 min read
How JD Optimized Its WeChat Shopping Homepage for Lightning‑Fast Performance
360 Quality & Efficiency
360 Quality & Efficiency
Dec 13, 2019 · Operations

Using Zabbix to Monitor Service Ports and Configure Email Alerts

This article explains how to use Zabbix for simple service‑port monitoring, covering installation, host and item creation, trigger and graph setup, and email notification configuration, providing a practical guide for developers who need lightweight operational monitoring without writing custom code.

Email NotificationOperationsService Port
0 likes · 8 min read
Using Zabbix to Monitor Service Ports and Configure Email Alerts
Ctrip Technology
Ctrip Technology
Dec 5, 2019 · Backend Development

Node.js Engineering Practices at Ctrip: From Zero to One, Best Practices and Operations

This article details how Ctrip builds, deploys, tests, releases, and operates Node.js applications—including engineering processes, core middleware, Docker-based deployment, multi‑process communication, monitoring, and full‑link tracing—while sharing practical lessons learned from real‑world production use.

Backend DevelopmentBest PracticesDevOps
0 likes · 14 min read
Node.js Engineering Practices at Ctrip: From Zero to One, Best Practices and Operations
360 Tech Engineering
360 Tech Engineering
Dec 5, 2019 · Databases

Design and Implementation of a High‑Availability InfluxDB Cluster at 360

This article introduces the fundamentals of time‑series databases, explains why InfluxDB was chosen, describes the TSM storage engine and shard concepts, outlines the internal 360 InfluxDB‑HA architecture, compares its performance with a single node, and provides integration and future‑development guidelines.

Cluster ArchitectureInfluxDBmonitoring
0 likes · 8 min read
Design and Implementation of a High‑Availability InfluxDB Cluster at 360
Meitu Technology
Meitu Technology
Dec 4, 2019 · Backend Development

Design and Implementation of lmstfy: A Redis‑Based Task Queue Service

lmstfy is a stateless, Redis‑backed task‑queue service from Meitu that provides delayed execution, automatic retries, priority handling, expiration, and a RESTful HTTP API, while supporting horizontal scaling via namespace‑based token routing, rich Prometheus metrics, and future disk‑based storage extensions.

Distributed SystemsRedisTask Queue
0 likes · 15 min read
Design and Implementation of lmstfy: A Redis‑Based Task Queue Service
Java High-Performance Architecture
Java High-Performance Architecture
Dec 2, 2019 · Databases

How Redis Sentinel Ensures Automatic Failover and High Availability

Redis Sentinel provides an automated high‑availability solution for Redis by monitoring master health, broadcasting SDOWN/ODOWN messages, electing a new master based on priority, offset and runid, and allowing clients to discover the current master via sentinel commands, all explained with configuration examples and diagrams.

Redisconfigurationhigh availability
0 likes · 6 min read
How Redis Sentinel Ensures Automatic Failover and High Availability
MaGe Linux Operations
MaGe Linux Operations
Nov 26, 2019 · Operations

Master Prometheus: From Basics to Advanced Configuration and Alerts

This article introduces Prometheus, an open‑source monitoring system, explains its core components such as server, exporters, and Alertmanager, provides step‑by‑step installation and configuration instructions, demonstrates alert rule setup, and shows integration with tools like Grafana, Telegraf, Spring Boot and Canal.

AlertmanagerDevOpsGrafana
0 likes · 10 min read
Master Prometheus: From Basics to Advanced Configuration and Alerts
Huajiao Technology
Huajiao Technology
Nov 26, 2019 · Backend Development

How Pepperbus Unifies Asynchronous Task Management Across Diverse Tech Stacks

This article details the design, requirements, architecture, and operational dashboard of Pepperbus, a unified bus system that standardizes asynchronous task handling for PHP, Java, and Go services at Huajiao, highlighting its storage plug‑in model, Redis‑based protocol, and monitoring capabilities.

AsynchronousPHPQueue
0 likes · 8 min read
How Pepperbus Unifies Asynchronous Task Management Across Diverse Tech Stacks
dbaplus Community
dbaplus Community
Nov 25, 2019 · Operations

From Manual Ops to AI‑Powered Monitoring: Scaling Weibo Ads Infrastructure

This article outlines how the Weibo advertising team evolved its operations from hand‑crafted scripts to a fully automated, AI‑enhanced platform, covering service governance, multi‑datacenter deployment, a custom automation system (Kunkka), effective alerting, full‑link tracing, and a massive metric monitoring solution built on big‑data technologies.

DevOpsTraceaiops
0 likes · 15 min read
From Manual Ops to AI‑Powered Monitoring: Scaling Weibo Ads Infrastructure
DevOps Coach
DevOps Coach
Nov 24, 2019 · Cloud Native

Mastering Observability in Cloud‑Native Apps with Elastic Stack: A Four‑Step Guide

This article explains how cloud‑native applications can achieve full observability using the Elastic Stack by outlining the four essential steps—health checks, metrics, logs, and tracing—while discussing the underlying challenges, implementation patterns, and practical recommendations for reliable monitoring.

APMcloud-nativeelastic-stack
0 likes · 14 min read
Mastering Observability in Cloud‑Native Apps with Elastic Stack: A Four‑Step Guide
Programmer DD
Programmer DD
Nov 23, 2019 · Operations

Essential Checklist for Rapid Server Troubleshooting

This guide walks you through a systematic, step‑by‑step process for diagnosing and resolving poor‑performance or failure incidents on Linux servers, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O, logs, cron jobs and application‑level diagnostics.

LinuxOperationsmonitoring
0 likes · 11 min read
Essential Checklist for Rapid Server Troubleshooting
21CTO
21CTO
Nov 15, 2019 · Operations

How SRE Designs Highly Available Software Systems at Scale

This article presents Google SRE expert Ramón Medrano Llamas’s comprehensive guide on designing, operating, and maintaining large‑scale, highly available software systems, covering SRE fundamentals, daily workflows, scalability strategies, fault‑tolerant architecture, monitoring, and operational best practices.

SREScalable Systemsfault tolerance
0 likes · 13 min read
How SRE Designs Highly Available Software Systems at Scale
UCloud Tech
UCloud Tech
Nov 14, 2019 · Cloud Native

How LeXin Medical Streamlined Kubernetes with UCloud UK8S: A Migration Case Study

This article details LeXin Medical's journey from a manually built Kubernetes cluster to the UCloud UK8S platform, covering the challenges of self‑hosting, the tools and processes used for migration, and the resulting improvements in logging, monitoring, CI/CD, and overall operational efficiency.

Cloud NativeDevOpsKubernetes
0 likes · 10 min read
How LeXin Medical Streamlined Kubernetes with UCloud UK8S: A Migration Case Study
Huajiao Technology
Huajiao Technology
Nov 12, 2019 · Operations

How to Build a Scalable API Automation Framework for Search Services

This article explains the design, core features, implementation details, and real‑world deployment of the Auto_ApiTest tool for automating API testing in a large‑scale search platform, covering data management, configuration, code examples, CI integration, monitoring, and measurable outcomes.

API testingPythonautomation
0 likes · 17 min read
How to Build a Scalable API Automation Framework for Search Services
DataFunTalk
DataFunTalk
Nov 7, 2019 · Big Data

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

This article details Beike's real‑time computing engine, covering its background, streaming platform built on Spark Streaming and Flink, data ingestion via Kafka, metadata handling, SQL‑based task development, monitoring, storage solutions, and future roadmap for resource management and AI‑enhanced monitoring.

Big DataFlinkKafka
0 likes · 14 min read
Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans
360 Quality & Efficiency
360 Quality & Efficiency
Nov 1, 2019 · Mobile Development

Using uiautomator1.0 for Android Automation: Shell Context, PackageManager, Database, Activity & Process Monitoring, and Chinese Input Support

This article demonstrates how to leverage uiautomator1.0 for Android automation by creating a shell‑based Context, accessing PackageManager, managing SQLite databases, monitoring app activities and processes, and implementing Chinese text input through AccessibilityNodeInfo.

AndroidDatabaseautomation
0 likes · 4 min read
Using uiautomator1.0 for Android Automation: Shell Context, PackageManager, Database, Activity & Process Monitoring, and Chinese Input Support
System Architect Go
System Architect Go
Oct 30, 2019 · Databases

InfluxDB Monitoring, Backup, and Restore Guide

This article explains InfluxDB's built‑in monitoring system, internal measurements, useful commands, HTTP endpoints, and provides detailed instructions for performing full backups and restores, including configuration tweaks, command syntax, and important considerations about formats and data ranges.

InfluxDBRestoreTimeSeriesDB
0 likes · 5 min read
InfluxDB Monitoring, Backup, and Restore Guide
Tencent Cloud Developer
Tencent Cloud Developer
Oct 25, 2019 · Backend Development

High-Concurrency Practices for Tencent Video Front-End Node.js Services

Tencent Video’s front‑end Node.js services achieve massive concurrency stability through a layered architecture that combines GSLB‑directed CDN, TGW, Nginx, and clustered workers, reinforced by process guardians, three‑tier disaster‑recovery fallbacks, multi‑level caching with lock mechanisms, and comprehensive logging and alerting.

AvailabilityNode.jshigh concurrency
0 likes · 11 min read
High-Concurrency Practices for Tencent Video Front-End Node.js Services
Ctrip Technology
Ctrip Technology
Oct 17, 2019 · Backend Development

CDubbo: Ctrip’s Customized Dubbo Framework – Architecture, Governance, Monitoring, and Extensions

This article describes how Ctrip introduced a customized Dubbo framework called CDubbo, covering the motivations for adopting Dubbo, the initial implementation of service governance and monitoring, and subsequent extensions such as callback enhancement, serialization support, circuit‑breaking, testing tools, and a bastion testing gateway.

Backend DevelopmentDubboRPC
0 likes · 13 min read
CDubbo: Ctrip’s Customized Dubbo Framework – Architecture, Governance, Monitoring, and Extensions
dbaplus Community
dbaplus Community
Oct 16, 2019 · Operations

How to Cut Alert Noise: Practical SRE Strategies for Ops Teams

This article shares concrete SRE‑inspired techniques—duty‑roster scheduling, tiered alert handling, automation safeguards, dashboard focus on top‑3 alerts, time‑based filtering, and systematic code review—to dramatically reduce daily alarm volume while keeping on‑call teams motivated and effective.

Incident ManagementOn-CallSRE
0 likes · 15 min read
How to Cut Alert Noise: Practical SRE Strategies for Ops Teams
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 16, 2019 · Operations

Intelligent Operations for Large-Scale Cloud Infrastructure: Insights from Alibaba and Intel at the 2019 Hangzhou Cloud Expo

At the 2019 Hangzhou Cloud Expo, Alibaba and Intel experts presented a series of intelligent operation solutions for large‑scale cloud infrastructure—including automated server repair, network change verification, application operation brain, monitoring advancements, power‑optimization, and data‑center management—demonstrating how AI‑driven techniques improve stability, cost, and efficiency.

Cloud ComputingIntelligent Operationsautomation
0 likes · 7 min read
Intelligent Operations for Large-Scale Cloud Infrastructure: Insights from Alibaba and Intel at the 2019 Hangzhou Cloud Expo
dbaplus Community
dbaplus Community
Oct 15, 2019 · Big Data

How to Build Real‑Time Data Pipelines for E‑Commerce Promotions

This article examines the surge in real‑time data demands for e‑commerce promotions, outlines how to collect, compute, and deliver streaming data, compares batch and stream processing, lists typical use cases, and discusses the challenges of building scalable, low‑latency pipelines.

Data Streamingmonitoringreal-time
0 likes · 11 min read
How to Build Real‑Time Data Pipelines for E‑Commerce Promotions
Efficient Ops
Efficient Ops
Oct 14, 2019 · Operations

How AIOps Transforms IT Operations: Real-World Architecture and Lessons

This article shares a practical case study of implementing AIOps in an online‑education company, covering the background pain points of massive monitoring data, the designed architecture with real‑time processing and machine‑learning pipelines, and the challenges and opportunities of intelligent operations.

Big DataIT Operationsaiops
0 likes · 14 min read
How AIOps Transforms IT Operations: Real-World Architecture and Lessons
37 Interactive Technology Team
37 Interactive Technology Team
Sep 27, 2019 · Operations

Centralized Management of Cron Jobs: Challenges and Solutions

The article outlines how a company built a centralized cron‑job platform—using Python’s crontab library, SaltStack deployment, ELK log aggregation, and automated email alerts—to integrate existing tasks, provide reliable CRUD operations, enable fast log querying, and detect failures, cutting operational overhead while managing thousands of scheduled jobs across multiple servers.

Log ManagementOperationsPython
0 likes · 8 min read
Centralized Management of Cron Jobs: Challenges and Solutions
GF Securities FinTech
GF Securities FinTech
Sep 23, 2019 · Backend Development

Why Our Team Switched from Node.js to Go: Lessons in Backend Engineering

This article details how a high‑traffic trading app migrated from Node.js to Go, outlining Go's advantages, drawbacks, and the team's engineering practices—including environment management, dependency handling, efficiency tools, standardized libraries, testing, monitoring, and distributed tracing—to achieve robust, high‑performance backend services.

Backend EngineeringGoci/cd
0 likes · 16 min read
Why Our Team Switched from Node.js to Go: Lessons in Backend Engineering