Tagged articles

2195 articles

Page 17 of 22

Apr 8, 2020 · Operations

Bilibili DevOps Case Study: Culture, Community, User‑Driven Demand Management, High‑Performance Microservices, and Data Operations

This article presents a comprehensive DevOps case study of Bilibili, covering its cultural background, community ecosystem, user‑centric demand management, migration to high‑performance microservices, and the implementation of logging, monitoring, and real‑time data platforms to support rapid, reliable delivery.

BilibiliData PlatformDevOps

0 likes · 17 min read

Bilibili DevOps Case Study: Culture, Community, User‑Driven Demand Management, High‑Performance Microservices, and Data Operations

Efficient Ops

Apr 6, 2020 · Databases

How to Build a MySQL Monitoring Platform with Prometheus and Grafana

This article walks through setting up a production‑grade MySQL monitoring solution using Prometheus and Grafana, covering exporter installation, MySQL user configuration, systemd service setup, Prometheus job definition, key MySQL performance metrics, and basic alerting rules.

GrafanaMetricsMySQL

0 likes · 15 min read

How to Build a MySQL Monitoring Platform with Prometheus and Grafana

ITFLY8 Architecture Home

Apr 5, 2020 · Backend Development

Master Spring Boot Actuator: Quick Start, Key Endpoints, and Security

This tutorial walks through what Spring Boot Actuator is, how to quickly create a demo project, configure endpoint exposure, explore essential endpoints such as health, metrics, loggers, and shutdown, and secure them with Spring Security, providing code snippets and configuration examples.

ActuatorBackend DevelopmentEndpoints

0 likes · 14 min read

Master Spring Boot Actuator: Quick Start, Key Endpoints, and Security

Java Backend Technology

Apr 5, 2020 · Backend Development

Mastering Micrometer: From Counters to Grafana Dashboards in Spring Boot

This tutorial walks through Micrometer's metric types, how to register them with MeterRegistry, apply tags and naming conventions, and integrate the framework into Spring Boot applications with Actuator, Prometheus scraping, and Grafana visualization for comprehensive backend monitoring.

GrafanaJavaMetrics

0 likes · 27 min read

Mastering Micrometer: From Counters to Grafana Dashboards in Spring Boot

360 Quality & Efficiency

Apr 3, 2020 · Operations

Prometheus Monitoring System: Concepts, Architecture, and Hands‑On Deployment with Node Exporter and Grafana

This article introduces the core concepts and architecture of the open‑source Prometheus monitoring system, explains its data model and metric types, and provides a step‑by‑step guide to install a Prometheus server, collect host metrics with Node Exporter, and visualize them using Grafana.

GrafanaMetricsObservability

0 likes · 10 min read

Prometheus Monitoring System: Concepts, Architecture, and Hands‑On Deployment with Node Exporter and Grafana

Efficient Ops

Apr 1, 2020 · Operations

How to Use Nagios for Business-Level Service Monitoring: A Step-by-Step Guide

This article explains why traditional server and service monitoring (e.g., Zabbix) may miss business outages, then walks through setting up Nagios on Debian to monitor web page URLs, API health checks, and related services, including configuration files, plugins, and a desktop alert tool, Nagstamon.

LinuxNagiosbusiness availability

0 likes · 18 min read

How to Use Nagios for Business-Level Service Monitoring: A Step-by-Step Guide

Alibaba Terminal Technology

Apr 1, 2020 · Frontend Development

How to Build a Robust Frontend Safety Production System for High‑Reliability Web Apps

This article explains the concept of frontend safety production, outlines its evolution from basic monitoring to a systematic, cloud‑enabled framework, and details the core capabilities—pre‑change CI checks, gray‑release gating, and real‑time monitoring—required to ensure high‑quality, risk‑free frontend deployments.

CIFrontendRisk Assessment

0 likes · 12 min read

How to Build a Robust Frontend Safety Production System for High‑Reliability Web Apps

Java Captain

Apr 1, 2020 · Operations

Comprehensive Guide to Online Environment Deployment and Operations Practices

This article provides a thorough overview of planning, provisioning, and managing online production environments—including user sizing, bandwidth estimation, database design, OS versus container deployment, middleware selection, security, monitoring, SSH shortcuts, file transfer tools, automation scripts, Docker setup, and log viewing techniques—aimed at giving developers a complete operational perspective.

DeploymentDockerOperations

0 likes · 16 min read

Comprehensive Guide to Online Environment Deployment and Operations Practices

FunTester

Mar 31, 2020 · Operations

Interface Performance Testing – Tools, Scripts, and Guides

This article compiles a comprehensive list of resources—including tools, scripts, and tutorials—for conducting interface performance testing on Linux and other platforms, covering topics such as netdata localization, timewatch utility, load testing strategies, JVM heap dumps, and visualizing test data.

APILinuxmonitoring

0 likes · 6 min read

Interface Performance Testing – Tools, Scripts, and Guides

Continuous Delivery 2.0

Mar 30, 2020 · Operations

Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling

The article explains how Facebook manages dynamic runtime configuration for millions of services—covering feature gating, experiments, traffic control, topology balancing, monitoring, machine‑learning model updates, and internal behavior—using a suite of tools such as Configerator, Gatekeeper, Package Vessel, Sitevars, and MobileConfig.

AB testingcloud operationsconfiguration-management

0 likes · 8 min read

Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling

Efficient Ops

Mar 26, 2020 · Operations

Why SRE Exists and How It Solves Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how it addresses reliability challenges through decoupling, SLO‑driven monitoring, and scenario‑based drills, while highlighting key observations and focus areas for modern operations teams.

SLOSREmonitoring

0 likes · 13 min read

Why SRE Exists and How It Solves Reliability Challenges

Ops Development Stories

Mar 26, 2020 · Operations

How to Auto‑Discover and Monitor Redis Ports with Zabbix

This guide explains how to use Zabbix's auto‑discovery feature to automatically find Redis instances on a server, create shell or Python scripts for port detection, configure Zabbix agent keys, set up server‑side templates, discovery rules, item prototypes, graphs, and triggers, and finally apply the template to monitored hosts.

Auto-discoveryPythonRedis

0 likes · 9 min read

How to Auto‑Discover and Monitor Redis Ports with Zabbix

Efficient Ops

Mar 25, 2020 · Operations

How JD Logistics Built a 300‑Million‑Metric Real‑Time Monitoring System for 99.999% Uptime

This article details JD Logistics' journey to design and implement a massive, AI‑enhanced monitoring platform that handles over three million metrics across hundreds of warehouses, addressing challenges of scale, network complexity, frequent asset changes, and integrating AIOps for proactive fault detection and resolution.

Anomaly DetectionCMDBKafka

0 likes · 23 min read

How JD Logistics Built a 300‑Million‑Metric Real‑Time Monitoring System for 99.999% Uptime

Didi Tech

Mar 21, 2020 · Operations

Why Didi’s Nightingale Is Redefining Cloud‑Native Monitoring

Nightingale, Didi’s open‑source enterprise monitoring platform, builds on Open‑Falcon but adds a hierarchical object tree, in‑memory indexing, Gorilla‑compressed time‑series storage, a hybrid push‑pull alert engine, built‑in log monitoring, and a unified monapi module, delivering scalable, cloud‑native observability for both container and bare‑metal workloads.

Cloud NativeObservabilityOpen-Falcon

0 likes · 10 min read

Why Didi’s Nightingale Is Redefining Cloud‑Native Monitoring

Open Source Linux

Mar 19, 2020 · Operations

Essential Ops Playbook: Avoid Costly Mistakes in Server Management

This guide shares practical Linux server operation rules, emphasizing thorough testing, careful use of destructive commands, strict access control, regular backups, security hardening, continuous monitoring, and disciplined performance tuning to prevent costly outages and data loss.

Performance tuningbackupmonitoring

0 likes · 13 min read

Essential Ops Playbook: Avoid Costly Mistakes in Server Management

Efficient Ops

Mar 16, 2020 · Cloud Native

Designing a Scalable, High‑Availability Kubernetes Monitoring Solution at Xiaomi

This article details Xiaomi's implementation of a highly available, persistent, and dynamically scalable Kubernetes monitoring system, covering challenges, architecture choices, Prometheus federation, performance testing, and future enhancements for cloud‑native observability.

KubernetesPrometheusmonitoring

0 likes · 18 min read

Designing a Scalable, High‑Availability Kubernetes Monitoring Solution at Xiaomi

Efficient Ops

Mar 8, 2020 · Operations

Prometheus vs Zabbix: Install, Configure & Visualize with Grafana

This article compares Prometheus with Zabbix, walks through downloading and installing Prometheus, explains the key sections of prometheus.yml, shows how to add a node_exporter for machine metrics, and demonstrates integrating Grafana to create rich monitoring dashboards.

GrafanaLinuxPrometheus

0 likes · 11 min read

Prometheus vs Zabbix: Install, Configure & Visualize with Grafana

Didi Tech

Mar 5, 2020 · R&D Management

Lean Development Practices and DevOps Implementation at Didi: Coding, Testing, Monitoring, and Ecosystem

At Didi, lean‑production ideas are woven into DevOps by establishing coding standards with SemVer and the NUWA framework, introducing traffic‑recording replay and a sim‑sidecar for realistic testing, extending monitoring with fine‑grained metrics, and unifying these practices into an ecosystem that cuts waste, speeds releases, and boosts overall software quality.

Frameworklean developmentmonitoring

0 likes · 7 min read

Lean Development Practices and DevOps Implementation at Didi: Coding, Testing, Monitoring, and Ecosystem

Efficient Ops

Mar 4, 2020 · Operations

Master Zabbix: From Installation to Advanced Custom Monitoring

This guide explains why monitoring is essential, describes the concept of availability "X nines," walks through Zabbix installation, web interface setup, host and template configuration, custom monitoring, alerting with OneAlert, visualization, distributed monitoring, SNMP integration, and provides practical command examples for managing large server fleets.

Linuxautomationmonitoring

0 likes · 20 min read

Master Zabbix: From Installation to Advanced Custom Monitoring

Tencent IMWeb Frontend Team

Mar 4, 2020 · Frontend Development

How Tencent Classroom’s Front‑End Team Survived Pandemic Traffic Surges

During the COVID‑19 pandemic, Tencent Classroom’s front‑end team faced unprecedented traffic spikes, forcing rapid decisions on domain stability, video streaming, data platforms, messaging, monitoring, and deployment pipelines, while sharing lessons on scaling, resilience, and collaborative development under extreme pressure.

DeploymentFrontendScaling

0 likes · 13 min read

How Tencent Classroom’s Front‑End Team Survived Pandemic Traffic Surges

Programmer DD

Mar 4, 2020 · Frontend Development

Customize Grafana Themes Without Rebuilding the Source Code

This guide walks you through a step‑by‑step method to add and switch custom Grafana themes using the Boom Theme panel plugin and ready‑made theme packs from GitHub, enabling theme changes across dashboards without modifying Grafana's source code.

Frontend DevelopmentGrafanaTheme Customization

0 likes · 5 min read

Customize Grafana Themes Without Rebuilding the Source Code

Wukong Talks Architecture

Mar 3, 2020 · Databases

Using Druid DataSource in Spring Boot: Configuration, Monitoring, and Troubleshooting

This article explains what Druid is, how to add the Druid dependency, configure it in Spring Boot's application.yml, set up monitoring with a custom DruidConfig class, and resolve common errors such as property binding failures and login issues.

Database Connection PoolDruidJava

0 likes · 7 min read

Using Druid DataSource in Spring Boot: Configuration, Monitoring, and Troubleshooting

Beike Product & Technology

Feb 27, 2020 · Big Data

Real‑Time Computing with Apache Flink at Beike Zhaofang: Hermes Platform Overview and Future Plans

This article presents the evolution, architecture, and operational metrics of Beike Zhaofang's Hermes real‑time computing platform built on Apache Flink, detailing its business scale, SQL editors, task growth, monitoring, use cases, and future development directions.

Apache FlinkBig DataData Engineering

0 likes · 10 min read

Real‑Time Computing with Apache Flink at Beike Zhaofang: Hermes Platform Overview and Future Plans

Ops Development Stories

Feb 20, 2020 · Operations

Monitor OPNsense with Zabbix: Complete Template Installation Guide

This guide walks through downloading the pfSense Zabbix template, installing the os‑zabbix‑agent plugin on OPNsense, configuring custom agent parameters, testing connectivity, and setting up the host and template in Zabbix Server to monitor OPNsense metrics.

OPNsenseagentmonitoring

0 likes · 4 min read

Monitor OPNsense with Zabbix: Complete Template Installation Guide

Qunar Tech Salon

Feb 20, 2020 · Operations

Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud

This article explains why monitoring is essential for operations, outlines the four‑layer monitoring standard (infrastructure, liveliness, performance, business), breaks down functional modules and data flows, and showcases JD Cloud's practical design, alarm‑convergence project, and future AI‑driven observability directions.

JD CloudObservabilityOperations

0 likes · 12 min read

Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud

Product Technology Team

Feb 19, 2020 · Frontend Development

How Zhenkun Built a Unified Frontend Tech Stack for Rapid Scaling

This article details how Zhenkun's frontend team responded to fast business growth by unifying their tech stack—introducing a private npm registry, a custom CLI scaffolding tool, Node.js backend, mock services, standardized webpack builds, DevOps automation, static resource delivery, monitoring, visual editors, UI component libraries, and automated testing—to boost development efficiency and maintainability across multiple locations.

DevOpsFrontendautomation

0 likes · 15 min read

How Zhenkun Built a Unified Frontend Tech Stack for Rapid Scaling

Didi Tech

Feb 18, 2020 · Operations

Didi's National Carpool Day: Technical Insights into Stability Assurance

Didi's National Carpool Day on Dec 3 2019 attracted 3.1M passengers; stability ensured via six pillars: organized task force, capacity forecasting and rapid container scaling, comprehensive monitoring with fire‑fighting map, robust contingency platform, strict process standards, and coordinated third‑party preparation.

Carpool DayDidiOperations

0 likes · 13 min read

Didi's National Carpool Day: Technical Insights into Stability Assurance

Alibaba Cloud Developer

Feb 18, 2020 · Cloud Native

Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook

This article explains why online applications experience crashes during traffic spikes, outlines the complexity of modern cloud‑based service architectures, and shares Alibaba engineers’ practical notes on high‑availability design, capacity planning, full‑link stress testing, monitoring, traffic control, routine inspections, and chaos‑engineering drills using tools such as AHAS, PTS, Sentinel and Advisor.

Alibaba Cloudcapacity planningchaos engineering

0 likes · 12 min read

Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook

Efficient Ops

Feb 17, 2020 · Operations

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

This article outlines how Ping An’s IT operations team systematically prepares for high‑traffic business events—detailing service assessment, architecture mapping, configuration audits, monitoring design, capacity planning, stress testing, and coordinated incident response—to guarantee reliability and performance under massive concurrent loads.

IT OperationsIncident ResponsePerformance Optimization

0 likes · 15 min read

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

Alibaba Cloud Developer

Feb 17, 2020 · Operations

How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability

This article details Hema's approach to guaranteeing system stability for its offline and delivery operations, covering the complete smart‑dispatch architecture, exhaustive dependency analysis, database and middleware safeguards, monitoring strategies, gray‑release practices, testing methods, and emergency response procedures that together enabled a year of zero failures.

Backend ArchitectureDatabase Optimizationmicroservices

0 likes · 24 min read

How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability

ITPUB

Feb 10, 2020 · Operations

Essential Linux and Java Debugging Commands for Rapid Issue Diagnosis

This guide compiles a practical collection of Linux command‑line tricks and Java troubleshooting tools—such as tail, grep, awk, find, tsar, btrace, Greys, jstack, jmap and more—complete with usage examples, code snippets and visual outputs to help engineers quickly diagnose and resolve production problems.

DebuggingTroubleshootingmonitoring

0 likes · 17 min read

Essential Linux and Java Debugging Commands for Rapid Issue Diagnosis

Architects' Tech Alliance

Feb 4, 2020 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article walks through the transformation of an online supermarket from a simple monolithic website to a fully fledged microservice architecture, highlighting the motivations, design decisions, common pitfalls, and essential components such as monitoring, tracing, logging, gateways, service discovery, circuit breaking, testing strategies, and service mesh adoption.

DeploymentService Mesharchitecture

0 likes · 22 min read

Microservice Architecture Evolution: From Monolith to Service Mesh

Big Data Technology Architecture

Jan 31, 2020 · Big Data

Practical Experience with HBase at NetEase: Architecture, Core Use Cases, HBCK & RIT Troubleshooting, and Diagnosis Strategies

This article summarizes NetEase Hangzhou Research Institute expert Fan Xinxin's presentation on HBase, covering its role in the big‑data ecosystem, core production scenarios, RIT and HBCK troubleshooting techniques, and systematic monitoring and log‑analysis methods for diagnosing HBase issues.

HBCKHBaseRIT

0 likes · 11 min read

Practical Experience with HBase at NetEase: Architecture, Core Use Cases, HBCK & RIT Troubleshooting, and Diagnosis Strategies

Java Backend Technology

Jan 23, 2020 · Backend Development

Master Spring Boot Actuator: Real‑Time Monitoring, Metrics, and Dynamic Log Levels

This tutorial walks you through using Spring Boot Actuator to monitor microservice applications, covering quick setup, essential endpoints such as health, metrics, loggers, and shutdown, customizing health indicators, dynamically changing log levels at runtime, and securing actuator endpoints with Spring Security.

ActuatorMetricsSecurity

0 likes · 14 min read

Master Spring Boot Actuator: Real‑Time Monitoring, Metrics, and Dynamic Log Levels

dbaplus Community

Jan 22, 2020 · Backend Development

How to Simulate 100 Billion WeChat Red‑Packet Requests on a Single Server

This article details a practical experiment that reproduces the load of 100 billion WeChat red‑packet (shake‑and‑grab) requests by simulating 1 million concurrent users on a single machine, achieving peak QPS of 60 k and demonstrating the architectural choices, hardware setup, and monitoring techniques required for such high‑throughput backend systems.

GoQPShigh‑throughput

0 likes · 18 min read

How to Simulate 100 Billion WeChat Red‑Packet Requests on a Single Server

Alibaba Cloud Native

Jan 22, 2020 · Backend Development

Mastering Microservices: RPC, Service Discovery, Config, Scheduling & More

This comprehensive guide explains the benefits of microservices and walks through core building blocks such as RPC, service discovery, configuration management, task scheduling, distributed locking, unified monitoring, caching strategies, message queues, distributed transactions, CAP theory, seckill handling, Docker isolation, and modern CI/CD deployment pipelines.

Configuration ManagementTask Schedulingbackend

0 likes · 24 min read

Mastering Microservices: RPC, Service Discovery, Config, Scheduling & More

Top Architect

Jan 21, 2020 · Operations

Comprehensive Guide to Java Application Performance Optimization and Troubleshooting

This article provides a detailed, step‑by‑step guide for diagnosing and fixing performance problems in Java applications, covering code‑level pitfalls, CPU and memory analysis, disk and network I/O bottlenecks, and a collection of practical command‑line tools for rapid troubleshooting.

JVMJavaPerformance Optimization

0 likes · 21 min read

Comprehensive Guide to Java Application Performance Optimization and Troubleshooting

Architect's Tech Stack

Jan 17, 2020 · Backend Development

Spring Boot Actuator: Quick Start, Key Endpoints, Monitoring and Security Integration

This article walks through using Spring Boot Actuator to monitor micro‑service applications, covering quick project setup, essential endpoints such as health, metrics, loggers and shutdown, custom health indicator implementation, dynamic log level changes, and securing actuator endpoints with Spring Security.

EndpointsJavaSecurity

0 likes · 13 min read

Spring Boot Actuator: Quick Start, Key Endpoints, Monitoring and Security Integration

JD Retail Technology

Jan 16, 2020 · Backend Development

Architecture and Key Technologies of a Scalable Message Push Platform

The document outlines the design, key components, data flow, and operational strategies of a large‑scale message push platform, detailing its architecture, request handling, long‑connection management, retry mechanisms, data statistics, monitoring, and future expansion plans.

Backend ArchitectureData AnalyticsLong Connections

0 likes · 15 min read

Architecture and Key Technologies of a Scalable Message Push Platform

DevOps Cloud Academy

Jan 16, 2020 · Cloud Native

Deploying Prometheus, Grafana, and Node Exporter on Kubernetes Using YAML Manifests

This guide walks through deploying node‑exporter, Prometheus, and Grafana on a Kubernetes cluster with YAML manifests, configuring services, RBAC, and Grafana dashboards to monitor cluster metrics, and includes verification steps and code examples.

Cloud NativeDevOpsGrafana

0 likes · 7 min read

Deploying Prometheus, Grafana, and Node Exporter on Kubernetes Using YAML Manifests

Architecture Digest

Jan 14, 2020 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article walks through the evolution of an online supermarket from a simple monolithic website to a fully split microservice system, highlighting the motivations, architectural changes, common pitfalls, and practical solutions such as monitoring, tracing, service discovery, circuit breaking, testing, and the eventual adoption of a service mesh.

Service Mesharchitecturemicroservices

0 likes · 22 min read

Architecture Digest

Jan 12, 2020 · Backend Development

Understanding Microservices Architecture: Concepts, Benefits, and Core Components

This article explains the fundamentals of microservices architecture, detailing its definition, core principles such as small independent services and lightweight communication, the advantages and drawbacks, suitable organizational contexts, and the essential technical components like service discovery, gateways, configuration centers, monitoring, circuit breaking, and container orchestration.

architecturegatewaymicroservices

0 likes · 15 min read

Understanding Microservices Architecture: Concepts, Benefits, and Core Components

JD Retail Technology

Jan 8, 2020 · Operations

Comprehensive Guide to E‑commerce Promotion Traffic Management and System Preparation

This article explains how e‑commerce promotions differ from offline sales by offering lower participation thresholds and flexible discount tactics, outlines methods for estimating and handling traffic spikes, and provides detailed strategies for system capacity planning, load testing, monitoring, and incident response to ensure stable large‑scale promotional events.

Scalingcapacity planninge‑commerce

0 likes · 23 min read

Comprehensive Guide to E‑commerce Promotion Traffic Management and System Preparation

360 Tech Engineering

Jan 7, 2020 · Operations

Introduction to Prometheus and Grafana for Monitoring and Alerting

This article provides a comprehensive overview of using Prometheus and Grafana for metric collection, storage, querying with PromQL, visualization, and alerting, including exporter integration, metric types, high‑availability setups, and practical examples for modern microservice architectures.

GrafanaMetricsPrometheus

0 likes · 10 min read

Introduction to Prometheus and Grafana for Monitoring and Alerting

Aikesheng Open Source Community

Jan 6, 2020 · Databases

Introduction to the DBLE Management Console and Reload Command

This article introduces the DBLE management console, explains its dual role in administration and monitoring, demonstrates how the reload command hot‑applies configuration changes, and provides guidance on using select/show commands for database inspection.

DBLEDatabase ManagementReload Command

0 likes · 3 min read

Introduction to the DBLE Management Console and Reload Command

Aikesheng Open Source Community

Jan 2, 2020 · Operations

Monitoring Alibaba Cloud RDS with Prometheus, Grafana, and Custom Exporters

This guide explains how to monitor Alibaba Cloud RDS instances by deploying Prometheus and Grafana, using the official mysqld_exporter, a custom aliyun-exporter, rebuilding Docker images, configuring supervisor and Prometheus service discovery, and automating the entire workflow while noting limitations.

Alibaba CloudDockerExporter

0 likes · 8 min read

Monitoring Alibaba Cloud RDS with Prometheus, Grafana, and Custom Exporters

Efficient Ops

Dec 29, 2019 · Operations

Master Linux Performance: Tools & Flame Graphs for Fast Issue Diagnosis

This article presents a comprehensive guide to Linux performance analysis, covering CPU, memory, disk I/O, network, system load, flame‑graph techniques, and a real‑world Nginx case study, enabling engineers to quickly locate and resolve bottlenecks.

CPU profilingLinuxSystem Optimization

0 likes · 19 min read

Master Linux Performance: Tools & Flame Graphs for Fast Issue Diagnosis

Tencent Cloud Developer

Dec 27, 2019 · Cloud Computing

Tencent Classroom Video Migration to Tencent Cloud: Architecture, Implementation, and Lessons Learned

Tencent Classroom migrated roughly four million videos (about 1,500 TB) to Tencent Cloud in a two‑phase rollout that integrated cloud upload, transcoding, encrypted HLS playback with anti‑leech and DRM, added AI‑based content moderation, resolved SDK and multi‑region issues, and built a custom mini‑program player, ultimately boosting upload success rates, playback reliability, and security.

DRMHLS encryptionTencent Cloud

0 likes · 13 min read

Tencent Classroom Video Migration to Tencent Cloud: Architecture, Implementation, and Lessons Learned

Qunar Tech Salon

Dec 27, 2019 · Operations

Qunar Ticket Test‑Environment Governance and Automated Monitoring Framework

This article describes Qunar Ticket’s comprehensive test‑environment governance framework, including the “Mirror‑Inspect” monitoring service, configuration and data synchronization strategies, and automated allocation management, highlighting how these practices reduced environment‑related project delays from up to 20% to below 8%.

Configuration ManagementOperationsmonitoring

0 likes · 11 min read

Qunar Ticket Test‑Environment Governance and Automated Monitoring Framework

Ops Development Stories

Dec 26, 2019 · Operations

How to Integrate ELK with Zabbix for Real‑Time Log Alerting

This guide explains how to combine ELK (Elasticsearch, Logstash, Kibana) with Zabbix using the logstash-output-zabbix plugin, configure Logstash pipelines to filter error keywords, and set up Zabbix templates and triggers for instant log‑based alerts.

ELKLog ManagementLogstash

0 likes · 15 min read

How to Integrate ELK with Zabbix for Real‑Time Log Alerting

dbaplus Community

Dec 25, 2019 · Backend Development

How NetEase Cloud Music Built a Custom High‑Availability Message Queue on RocketMQ

This article details NetEase Cloud Music's journey from evaluating RabbitMQ, Kafka, and RocketMQ to designing a fully controllable, high‑availability message queue with failover, tracing, monitoring, and numerous custom extensions that now serve hundreds of services and billions of messages daily.

Distributed SystemsMessage QueueRocketMQ

0 likes · 15 min read

How NetEase Cloud Music Built a Custom High‑Availability Message Queue on RocketMQ

Aikesheng Open Source Community

Dec 25, 2019 · Operations

Deploying Thanos for Unified Prometheus Monitoring and Long‑Term Storage

This guide explains the background, key features, architecture, and step‑by‑step deployment of Thanos—including Sidecar, Store, Query, Compact, Bucket, Rule, and Check components—to provide a unified, high‑availability Prometheus monitoring view with unlimited historical data storage using object storage.

Cloud NativeDeploymentLong‑term Storage

0 likes · 9 min read

Deploying Thanos for Unified Prometheus Monitoring and Long‑Term Storage

HomeTech

Dec 25, 2019 · Operations

Automation in Brand Advertising Testing and Monitoring to Enhance Efficiency and Quality

This project addresses challenges in brand advertising testing by implementing automated testing, monitoring, and data construction solutions, significantly improving efficiency, reducing manual effort, and enhancing product quality through real-time issue detection and resolution.

Operationsautomationdata construction

0 likes · 5 min read

360 Tech Engineering

Dec 23, 2019 · Cloud Native

Using Thanos and Prometheus for Scalable Monitoring in OpenStack and Ceph Clusters

The article explains how Thanos combined with Prometheus provides a cloud‑native, highly available solution for long‑term metric storage and fast querying to address the exponential growth of monitoring data in large OpenStack and Ceph deployments.

Cloud NativeOpenStackPrometheus

0 likes · 7 min read

Using Thanos and Prometheus for Scalable Monitoring in OpenStack and Ceph Clusters

Ops Development Stories

Dec 23, 2019 · Operations

How to Send Zabbix Alerts with Images to DingTalk via Python

This guide explains how to extract an item ID from Zabbix alerts, capture the corresponding chart image, upload it to a public server, format the alert as markdown, and deliver it through a DingTalk robot webhook using a Python script.

AlertDingTalkPython

0 likes · 8 min read

How to Send Zabbix Alerts with Images to DingTalk via Python

Efficient Ops

Dec 22, 2019 · Operations

How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale

This article examines Baidu’s Noah monitoring and alarm platform, detailing its end‑to‑end fault‑handling workflow, the three‑component architecture, and the practical challenges of deploying AIOps—such as long algorithm iteration cycles, complex alarm management, and alarm storms—while highlighting scalability and commercial considerations.

Alarm ManagementOperationsaiops

0 likes · 15 min read

How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale

Alibaba Cloud Developer

Dec 20, 2019 · Operations

How We Traced a 48‑Hour Memory Leak in a Distributed Coordination Service

This article details a step‑by‑step investigation of repeated follower process alerts in a Paxos‑based distributed coordination service, revealing a Java GC pause‑induced memory leak in the front‑end Proxy and describing the rapid mitigation actions taken to restore system stability.

Distributed SystemsIncident ResponseMemory Leak

0 likes · 12 min read

How We Traced a 48‑Hour Memory Leak in a Distributed Coordination Service

Efficient Ops

Dec 19, 2019 · Operations

AIOps in Banking: Veteran’s Secrets to Smarter Operations

In this interview, veteran Bank of China software center analyst Yuan Chunliang shares two decades of experience, detailing how the bank’s shift to distributed core banking systems sparked the development of AIOps practices such as no‑threshold intelligent monitoring, multi‑indicator analytics, and AI‑driven ticket automation to boost operational efficiency and reduce risk.

Banking TechnologyIT Operationsaiops

0 likes · 14 min read

AIOps in Banking: Veteran’s Secrets to Smarter Operations

Programmer DD

Dec 19, 2019 · Backend Development

Why Microservices Matter: Core Principles, Benefits, and Architecture Explained

This article introduces the fundamental concepts of microservices, covering their definition, advantages, design principles, core components such as service discovery, gateways, configuration centers, monitoring, circuit breaking, and container orchestration, while also discussing suitable organizational structures and practical implementation details.

container orchestrationgatewaymicroservices

0 likes · 21 min read

Why Microservices Matter: Core Principles, Benefits, and Architecture Explained

Sohu Tech Products

Dec 18, 2019 · Backend Development

Node.js Performance Optimization: Common Techniques, Key Metrics, and Bottlenecks

This article answers a developer's question about Node.js performance optimization by outlining major optimization areas, listing practical techniques such as using streams, clustering, and load balancing, and describing typical bottlenecks and essential performance metrics to monitor.

Optimizationbackendmonitoring

0 likes · 3 min read

Node.js Performance Optimization: Common Techniques, Key Metrics, and Bottlenecks

MaGe Linux Operations

Dec 18, 2019 · Operations

Mastering Modern IT Operations: Roles, Practices, and Evolution

This article outlines the comprehensive responsibilities and evolution of IT operations, covering system, application, database, security, and platform management, detailing tasks such as infrastructure building, monitoring, optimization, automation, and the shift from manual processes to self‑scheduling systems.

IT OperationsInfrastructureSystem Administration

0 likes · 20 min read

Mastering Modern IT Operations: Roles, Practices, and Evolution

dbaplus Community

Dec 17, 2019 · Artificial Intelligence

How to Build a Scalable Intelligent Dispatch System for 400K Daily Orders

This article walks through the evolution of a ride‑hailing platform’s dispatch system—from a single‑database prototype to a data‑driven, AI‑powered architecture—detailing architectural choices, big‑data pipelines, model training, real‑time scheduling strategies, and monitoring practices for handling 400,000 daily orders.

AIDispatchRide Hailing

0 likes · 11 min read

How to Build a Scalable Intelligent Dispatch System for 400K Daily Orders

360 Tech Engineering

Dec 17, 2019 · Backend Development

Diagnosing Java Memory Leaks: JVM GC Roots, Monitoring, and Code Fixes

This article explains how Java memory leaks can occur despite automatic garbage collection, describes JVM GC‑Root analysis, outlines practical monitoring with Spring Boot Actuator, Prometheus, and Grafana, and provides step‑by‑step debugging commands and code adjustments to locate and fix the leak.

Garbage CollectionJVMJava

0 likes · 10 min read

Diagnosing Java Memory Leaks: JVM GC Roots, Monitoring, and Code Fixes

360 Zhihui Cloud Developer

Dec 17, 2019 · Operations

How Thanos + Prometheus Solve Large‑Scale OpenStack Monitoring Challenges

This article explains how the Thanos and Prometheus combination provides long‑term, highly available monitoring for massive OpenStack and Ceph clusters, detailing its features, architecture, key components, practical deployment issues, and the operational problems it resolves.

CephObservabilityOpenStack

0 likes · 8 min read

How Thanos + Prometheus Solve Large‑Scale OpenStack Monitoring Challenges

WecTeam

Dec 17, 2019 · Frontend Development

How JD Optimized Its WeChat Shopping Homepage for Lightning‑Fast Performance

By combining server‑side rendering, critical‑render‑path tuning, resource minification, image format upgrades, and RAIL‑based multi‑dimensional monitoring, JD dramatically reduced its WeChat shopping homepage’s first‑screen load time, offering a practical roadmap for front‑end performance optimization.

FrontendImage OptimizationRAIL model

0 likes · 17 min read

How JD Optimized Its WeChat Shopping Homepage for Lightning‑Fast Performance

360 Quality & Efficiency

Dec 13, 2019 · Operations

Using Zabbix to Monitor Service Ports and Configure Email Alerts

This article explains how to use Zabbix for simple service‑port monitoring, covering installation, host and item creation, trigger and graph setup, and email notification configuration, providing a practical guide for developers who need lightweight operational monitoring without writing custom code.

Email NotificationOperationsService Port

0 likes · 8 min read

Using Zabbix to Monitor Service Ports and Configure Email Alerts

Ops Development Stories

Dec 7, 2019 · Operations

Automate Zabbix Monitoring: Fetch Host Metrics and Export to CSV with Python

This guide demonstrates how to use Zabbix's API with Python to retrieve host information, item IDs, historical and trend data, process the metrics, and automatically write them into an Excel/CSV file, enabling scheduled monitoring reports.

APICSVPython

0 likes · 8 min read

Automate Zabbix Monitoring: Fetch Host Metrics and Export to CSV with Python

Ctrip Technology

Dec 5, 2019 · Backend Development

Node.js Engineering Practices at Ctrip: From Zero to One, Best Practices and Operations

This article details how Ctrip builds, deploys, tests, releases, and operates Node.js applications—including engineering processes, core middleware, Docker-based deployment, multi‑process communication, monitoring, and full‑link tracing—while sharing practical lessons learned from real‑world production use.

Backend DevelopmentBest PracticesDevOps

0 likes · 14 min read

Node.js Engineering Practices at Ctrip: From Zero to One, Best Practices and Operations

360 Tech Engineering

Dec 5, 2019 · Databases

Design and Implementation of a High‑Availability InfluxDB Cluster at 360

This article introduces the fundamentals of time‑series databases, explains why InfluxDB was chosen, describes the TSM storage engine and shard concepts, outlines the internal 360 InfluxDB‑HA architecture, compares its performance with a single node, and provides integration and future‑development guidelines.

Cluster ArchitectureInfluxDBmonitoring

0 likes · 8 min read

Design and Implementation of a High‑Availability InfluxDB Cluster at 360

Meitu Technology

Dec 4, 2019 · Backend Development

Design and Implementation of lmstfy: A Redis‑Based Task Queue Service

lmstfy is a stateless, Redis‑backed task‑queue service from Meitu that provides delayed execution, automatic retries, priority handling, expiration, and a RESTful HTTP API, while supporting horizontal scaling via namespace‑based token routing, rich Prometheus metrics, and future disk‑based storage extensions.

Distributed SystemsRedisTask Queue

0 likes · 15 min read

Design and Implementation of lmstfy: A Redis‑Based Task Queue Service

Java High-Performance Architecture

Dec 2, 2019 · Databases

How Redis Sentinel Ensures Automatic Failover and High Availability

Redis Sentinel provides an automated high‑availability solution for Redis by monitoring master health, broadcasting SDOWN/ODOWN messages, electing a new master based on priority, offset and runid, and allowing clients to discover the current master via sentinel commands, all explained with configuration examples and diagrams.

Redisconfigurationhigh availability

0 likes · 6 min read

How Redis Sentinel Ensures Automatic Failover and High Availability

Ops Development Stories

Nov 30, 2019 · Databases

How to Deploy Zabbix 4.4 with TimescaleDB on CentOS 7 – Step‑by‑Step Guide

This guide walks through installing Zabbix 4.4.0 on CentOS 7, configuring PostgreSQL, adding the TimescaleDB time‑series extension, setting up the Zabbix database, and tuning Linux, Nginx, and PHP so the monitoring platform runs smoothly with high‑performance time‑series storage.

CentOSDatabaseLinux

0 likes · 11 min read

How to Deploy Zabbix 4.4 with TimescaleDB on CentOS 7 – Step‑by‑Step Guide

Efficient Ops

Nov 28, 2019 · Operations

Master Modern IT Operations: Skill Maps, ELK Architectures & Big Data Monitoring

This article explores the evolving landscape of IT operations, detailing role specializations, comprehensive skill maps for system, web, big data, and container ops, and compares three ELK logging architectures while emphasizing a data‑driven approach to monitoring and incident response.

Big DataELKIT Operations

0 likes · 11 min read

Master Modern IT Operations: Skill Maps, ELK Architectures & Big Data Monitoring

dbaplus Community

Nov 27, 2019 · Operations

Scaling Ele.me’s Monitoring: From StatsD to a Unified LinDB‑Powered Platform

This article recounts Huang Jie’s presentation on the evolution of Ele.me’s monitoring system, detailing its three development stages, the challenges faced, the layered monitoring architecture, the design of a unified platform supporting both PC and mobile, and the underlying LinDB time‑series database.

EMonitorLinDBObservability

0 likes · 10 min read

Scaling Ele.me’s Monitoring: From StatsD to a Unified LinDB‑Powered Platform

MaGe Linux Operations

Nov 26, 2019 · Operations

Master Prometheus: From Basics to Advanced Configuration and Alerts

This article introduces Prometheus, an open‑source monitoring system, explains its core components such as server, exporters, and Alertmanager, provides step‑by‑step installation and configuration instructions, demonstrates alert rule setup, and shows integration with tools like Grafana, Telegraf, Spring Boot and Canal.

AlertmanagerDevOpsGrafana

0 likes · 10 min read

Master Prometheus: From Basics to Advanced Configuration and Alerts

Huajiao Technology

Nov 26, 2019 · Backend Development

How Pepperbus Unifies Asynchronous Task Management Across Diverse Tech Stacks

This article details the design, requirements, architecture, and operational dashboard of Pepperbus, a unified bus system that standardizes asynchronous task handling for PHP, Java, and Go services at Huajiao, highlighting its storage plug‑in model, Redis‑based protocol, and monitoring capabilities.

AsynchronousPHPQueue

0 likes · 8 min read

How Pepperbus Unifies Asynchronous Task Management Across Diverse Tech Stacks

dbaplus Community

Nov 25, 2019 · Operations

From Manual Ops to AI‑Powered Monitoring: Scaling Weibo Ads Infrastructure

This article outlines how the Weibo advertising team evolved its operations from hand‑crafted scripts to a fully automated, AI‑enhanced platform, covering service governance, multi‑datacenter deployment, a custom automation system (Kunkka), effective alerting, full‑link tracing, and a massive metric monitoring solution built on big‑data technologies.

DevOpsTraceaiops

0 likes · 15 min read

From Manual Ops to AI‑Powered Monitoring: Scaling Weibo Ads Infrastructure

DevOps Coach

Nov 24, 2019 · Cloud Native

Mastering Observability in Cloud‑Native Apps with Elastic Stack: A Four‑Step Guide

This article explains how cloud‑native applications can achieve full observability using the Elastic Stack by outlining the four essential steps—health checks, metrics, logs, and tracing—while discussing the underlying challenges, implementation patterns, and practical recommendations for reliable monitoring.

APMcloud-nativeelastic-stack

0 likes · 14 min read

Mastering Observability in Cloud‑Native Apps with Elastic Stack: A Four‑Step Guide

Programmer DD

Nov 23, 2019 · Operations

Essential Checklist for Rapid Server Troubleshooting

This guide walks you through a systematic, step‑by‑step process for diagnosing and resolving poor‑performance or failure incidents on Linux servers, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O, logs, cron jobs and application‑level diagnostics.

LinuxOperationsmonitoring

0 likes · 11 min read

Essential Checklist for Rapid Server Troubleshooting

21CTO

Nov 15, 2019 · Operations

How SRE Designs Highly Available Software Systems at Scale

This article presents Google SRE expert Ramón Medrano Llamas’s comprehensive guide on designing, operating, and maintaining large‑scale, highly available software systems, covering SRE fundamentals, daily workflows, scalability strategies, fault‑tolerant architecture, monitoring, and operational best practices.

SREScalable Systemsfault tolerance

0 likes · 13 min read

How SRE Designs Highly Available Software Systems at Scale

UCloud Tech

Nov 14, 2019 · Cloud Native

How LeXin Medical Streamlined Kubernetes with UCloud UK8S: A Migration Case Study

This article details LeXin Medical's journey from a manually built Kubernetes cluster to the UCloud UK8S platform, covering the challenges of self‑hosting, the tools and processes used for migration, and the resulting improvements in logging, monitoring, CI/CD, and overall operational efficiency.

Cloud NativeDevOpsKubernetes

0 likes · 10 min read

How LeXin Medical Streamlined Kubernetes with UCloud UK8S: A Migration Case Study

Huajiao Technology

Nov 12, 2019 · Operations

How to Build a Scalable API Automation Framework for Search Services

This article explains the design, core features, implementation details, and real‑world deployment of the Auto_ApiTest tool for automating API testing in a large‑scale search platform, covering data management, configuration, code examples, CI integration, monitoring, and measurable outcomes.

API testingPythonautomation

0 likes · 17 min read

How to Build a Scalable API Automation Framework for Search Services

dbaplus Community

Nov 11, 2019 · Operations

How EMonitor Outperforms CAT: A Deep Dive into Meituan’s Monitoring Evolution

This article compares Meituan’s in‑house EMonitor with the open‑source CAT platform, outlines their core monitoring models, sampling pipelines, custom metrics and integration capabilities, and traces the evolution of monitoring stages from log‑based to intelligent root‑cause analysis.

CATDistributed SystemsEMonitor

0 likes · 16 min read

How EMonitor Outperforms CAT: A Deep Dive into Meituan’s Monitoring Evolution

MaGe Linux Operations

Nov 10, 2019 · Operations

100 Essential Linux Ops Articles Curated by a Tech‑First Education Hub

The "马哥Linux运维" public account, built on a technology‑first philosophy, shares high‑quality, non‑clickbait content and has compiled the 100 most‑read Linux operations articles from the past three years, offering a comprehensive resource for sysadmins and DevOps engineers.

DevOpsautomationmonitoring

0 likes · 8 min read

100 Essential Linux Ops Articles Curated by a Tech‑First Education Hub

Qunhe Technology Quality Tech

Nov 9, 2019 · Operations

How We Cut BIM Drawing Failures from 0.01% to 0.0005% with Automated Monitoring

The BIM construction‑drawing team built an automated monitoring and validation tool using Spring Boot, REST‑Assured and JIRA APIs, turning a tedious manual bug‑fix workflow into a streamlined process that reduced online drawing‑failure rates from 0.01% to virtually zero.

BIMJiraOperations

0 likes · 5 min read

How We Cut BIM Drawing Failures from 0.01% to 0.0005% with Automated Monitoring

DataFunTalk

Nov 7, 2019 · Big Data

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

This article details Beike's real‑time computing engine, covering its background, streaming platform built on Spark Streaming and Flink, data ingestion via Kafka, metadata handling, SQL‑based task development, monitoring, storage solutions, and future roadmap for resource management and AI‑enhanced monitoring.

Big DataFlinkKafka

0 likes · 14 min read

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

Ops Development Stories

Nov 6, 2019 · Operations

How to Send Zabbix 4.2 Alerts with Embedded Images via Email and WeChat Using Python

This guide shows how to use Python 2.7 to extend Zabbix 4.2 alerts by attaching the current graph image to email and WeChat notifications, covering environment setup, script details, Zabbix media type configuration, and testing the final result.

Pythonautomationemail alerts

0 likes · 16 min read

How to Send Zabbix 4.2 Alerts with Embedded Images via Email and WeChat Using Python

NetEase Game Operations Platform

Nov 2, 2019 · Operations

Understanding Linux CPU Usage, Scheduling, and Performance Monitoring

This article explains how Linux reports CPU usage with tools like top, the meaning of the fields in /proc/stat, how utilization percentages are calculated, the concepts of run queues, load average, context switching, multi‑core scheduling, and how to use perf and taskset for deeper performance analysis.

CPULinuxScheduling

0 likes · 15 min read

Understanding Linux CPU Usage, Scheduling, and Performance Monitoring

360 Quality & Efficiency

Nov 1, 2019 · Mobile Development

Using uiautomator1.0 for Android Automation: Shell Context, PackageManager, Database, Activity & Process Monitoring, and Chinese Input Support

This article demonstrates how to leverage uiautomator1.0 for Android automation by creating a shell‑based Context, accessing PackageManager, managing SQLite databases, monitoring app activities and processes, and implementing Chinese text input through AccessibilityNodeInfo.

AndroidDatabaseautomation

0 likes · 4 min read

Using uiautomator1.0 for Android Automation: Shell Context, PackageManager, Database, Activity & Process Monitoring, and Chinese Input Support

System Architect Go

Oct 30, 2019 · Databases

InfluxDB Monitoring, Backup, and Restore Guide

This article explains InfluxDB's built‑in monitoring system, internal measurements, useful commands, HTTP endpoints, and provides detailed instructions for performing full backups and restores, including configuration tweaks, command syntax, and important considerations about formats and data ranges.

InfluxDBRestoreTimeSeriesDB

0 likes · 5 min read

InfluxDB Monitoring, Backup, and Restore Guide

Tencent Cloud Developer

Oct 25, 2019 · Backend Development

High-Concurrency Practices for Tencent Video Front-End Node.js Services

Tencent Video’s front‑end Node.js services achieve massive concurrency stability through a layered architecture that combines GSLB‑directed CDN, TGW, Nginx, and clustered workers, reinforced by process guardians, three‑tier disaster‑recovery fallbacks, multi‑level caching with lock mechanisms, and comprehensive logging and alerting.

AvailabilityNode.jshigh concurrency

0 likes · 11 min read

High-Concurrency Practices for Tencent Video Front-End Node.js Services

Ctrip Technology

Oct 17, 2019 · Backend Development

CDubbo: Ctrip’s Customized Dubbo Framework – Architecture, Governance, Monitoring, and Extensions

This article describes how Ctrip introduced a customized Dubbo framework called CDubbo, covering the motivations for adopting Dubbo, the initial implementation of service governance and monitoring, and subsequent extensions such as callback enhancement, serialization support, circuit‑breaking, testing tools, and a bastion testing gateway.

Backend DevelopmentDubboRPC

0 likes · 13 min read

CDubbo: Ctrip’s Customized Dubbo Framework – Architecture, Governance, Monitoring, and Extensions

dbaplus Community

Oct 16, 2019 · Operations

How to Cut Alert Noise: Practical SRE Strategies for Ops Teams

This article shares concrete SRE‑inspired techniques—duty‑roster scheduling, tiered alert handling, automation safeguards, dashboard focus on top‑3 alerts, time‑based filtering, and systematic code review—to dramatically reduce daily alarm volume while keeping on‑call teams motivated and effective.

Incident ManagementOn-CallSRE

0 likes · 15 min read

How to Cut Alert Noise: Practical SRE Strategies for Ops Teams

Alibaba Cloud Infrastructure

Oct 16, 2019 · Operations

Intelligent Operations for Large-Scale Cloud Infrastructure: Insights from Alibaba and Intel at the 2019 Hangzhou Cloud Expo

At the 2019 Hangzhou Cloud Expo, Alibaba and Intel experts presented a series of intelligent operation solutions for large‑scale cloud infrastructure—including automated server repair, network change verification, application operation brain, monitoring advancements, power‑optimization, and data‑center management—demonstrating how AI‑driven techniques improve stability, cost, and efficiency.

Cloud ComputingIntelligent Operationsautomation

0 likes · 7 min read

Intelligent Operations for Large-Scale Cloud Infrastructure: Insights from Alibaba and Intel at the 2019 Hangzhou Cloud Expo

dbaplus Community

Oct 15, 2019 · Big Data

How to Build Real‑Time Data Pipelines for E‑Commerce Promotions

This article examines the surge in real‑time data demands for e‑commerce promotions, outlines how to collect, compute, and deliver streaming data, compares batch and stream processing, lists typical use cases, and discusses the challenges of building scalable, low‑latency pipelines.

Data Streamingmonitoringreal-time

0 likes · 11 min read

How to Build Real‑Time Data Pipelines for E‑Commerce Promotions

Efficient Ops

Oct 14, 2019 · Operations

How AIOps Transforms IT Operations: Real-World Architecture and Lessons

This article shares a practical case study of implementing AIOps in an online‑education company, covering the background pain points of massive monitoring data, the designed architecture with real‑time processing and machine‑learning pipelines, and the challenges and opportunities of intelligent operations.

Big DataIT Operationsaiops

0 likes · 14 min read

How AIOps Transforms IT Operations: Real-World Architecture and Lessons

Ops Development Stories

Oct 11, 2019 · Cloud Native

Deploy a Complete Prometheus Monitoring Stack on Kubernetes (Step‑by‑Step)

This guide walks through the architecture of Prometheus, the key Kubernetes monitoring metrics, and step‑by‑step instructions to deploy Prometheus, Grafana, and Alertmanager on a K8s cluster, configure RBAC, set up ConfigMaps, expose services, import dashboards, and test alert notifications via email.

AlertmanagerDevOpsGrafana

0 likes · 27 min read

Deploy a Complete Prometheus Monitoring Stack on Kubernetes (Step‑by‑Step)

37 Interactive Technology Team

Sep 27, 2019 · Operations

Centralized Management of Cron Jobs: Challenges and Solutions

The article outlines how a company built a centralized cron‑job platform—using Python’s crontab library, SaltStack deployment, ELK log aggregation, and automated email alerts—to integrate existing tasks, provide reliable CRUD operations, enable fast log querying, and detect failures, cutting operational overhead while managing thousands of scheduled jobs across multiple servers.

Log ManagementOperationsPython

0 likes · 8 min read

Centralized Management of Cron Jobs: Challenges and Solutions

DevOps Cloud Academy

Sep 27, 2019 · Cloud Native

Configuring Prometheus Operator ServiceMonitor on OpenShift after Migrating from Mesos+Marathon

This article explains how to migrate a Mesos+Marathon environment to OpenShift and configure Prometheus Operator ServiceMonitor resources, including service creation, ServiceMonitor definition, and verification steps, with full YAML examples and screenshots of the monitoring UI.

Cloud NativeKubernetesOpenShift

0 likes · 6 min read

Configuring Prometheus Operator ServiceMonitor on OpenShift after Migrating from Mesos+Marathon

GF Securities FinTech

Sep 23, 2019 · Backend Development

Why Our Team Switched from Node.js to Go: Lessons in Backend Engineering

This article details how a high‑traffic trading app migrated from Node.js to Go, outlining Go's advantages, drawbacks, and the team's engineering practices—including environment management, dependency handling, efficiency tools, standardized libraries, testing, monitoring, and distributed tracing—to achieve robust, high‑performance backend services.

Backend EngineeringGoci/cd

0 likes · 16 min read

Why Our Team Switched from Node.js to Go: Lessons in Backend Engineering