Tagged articles

2194 articles

Page 3 of 22

Oct 3, 2025 · Big Data

How Qunar Travel Cut 2000 CPU Cores by Optimizing Kafka Production

This case study details how Qunar Travel's engineering team analyzed Kafka production bottlenecks during peak traffic, added targeted monitoring, tuned thread and batch parameters, and validated the changes through gray‑scale tests, ultimately saving about 2000 CPU cores across three clusters while reducing request volume and improving network and disk utilization.

Big DataCPU SavingsKafka

0 likes · 14 min read

How Qunar Travel Cut 2000 CPU Cores by Optimizing Kafka Production

Ops Community

Oct 2, 2025 · Operations

How to Fix Nginx 502 Bad Gateway Errors: A 90% Success Checklist

This article provides a comprehensive, step‑by‑step checklist for diagnosing and resolving Nginx 502 Bad Gateway errors, covering backend service verification, configuration checks, log analysis, resource monitoring, network troubleshooting, special scenarios, and long‑term preventive measures.

502Bad GatewayMonitoring

0 likes · 25 min read

How to Fix Nginx 502 Bad Gateway Errors: A 90% Success Checklist

MaGe Linux Operations

Oct 1, 2025 · Operations

How Automated Ops Cut Service Restarts by 80% and Save Hours Daily

Discover a comprehensive automated operations framework that eliminates manual service restarts, reduces repetitive tasks by 80%, accelerates fault recovery from minutes to seconds, and boosts reliability through health checks, Kubernetes self‑healing, Systemd scripts, monitoring, and scalable deployment strategies.

AutomationMonitoringOperations

0 likes · 37 min read

How Automated Ops Cut Service Restarts by 80% and Save Hours Daily

MaGe Linux Operations

Sep 30, 2025 · Cloud Native

How I Cut Kubernetes Troubleshooting Time from 30 Minutes to 3 Minutes

This article presents a complete, step‑by‑step method for reducing average Kubernetes fault‑diagnosis time from half an hour to under three minutes, covering the root causes of slow manual debugging, a one‑click diagnostic script, efficient kubectl shortcuts, visual tools, log aggregation, automated response workflows, and real‑world case studies.

AutomationDevOpsMonitoring

0 likes · 50 min read

How I Cut Kubernetes Troubleshooting Time from 30 Minutes to 3 Minutes

Ops Community

Sep 29, 2025 · Cloud Native

Enterprise Docker Deployment: From Zero to Production – A Complete Guide

This comprehensive guide walks through the evolution of container technology, explains Docker's core mechanisms, and presents enterprise‑grade architecture, deployment strategies, monitoring, security hardening, and real‑world case studies, helping ops engineers build efficient, scalable, and secure production‑ready Docker environments.

ContainerizationDockerEnterprise Deployment

0 likes · 19 min read

Enterprise Docker Deployment: From Zero to Production – A Complete Guide

Tech Freedom Circle

Sep 28, 2025 · Backend Development

Midnight TODO That Nearly Crashed the Whole Department: A JVM Performance Tuning Case Study

During a midnight promotion launch, a forgotten TODO caused thread‑pool exhaustion and frequent Full GC, bringing down an e‑commerce service; the article presents a five‑step end‑to‑end JVM tuning methodology, from data collection to root‑cause verification and code fix, showing how to diagnose and resolve such incidents.

Full GCHeap DumpJVM

0 likes · 24 min read

Midnight TODO That Nearly Crashed the Whole Department: A JVM Performance Tuning Case Study

Linux Ops Smart Journey

Sep 28, 2025 · Operations

Why Nightingale Is the Next‑Gen Open‑Source Monitoring Solution for Cloud‑Native Ops

This article introduces Nightingale, an open‑source, high‑availability monitoring and alerting platform designed for cloud‑native environments, compares it with Grafana and Prometheus, and provides step‑by‑step deployment, configuration, and login instructions for rapid adoption.

Monitoringcloud-nativenightingale

0 likes · 6 min read

Why Nightingale Is the Next‑Gen Open‑Source Monitoring Solution for Cloud‑Native Ops

Architecture Breakthrough

Sep 28, 2025 · Operations

How to Build an Organizational High‑Availability Mechanism for Banking IT Production Issues

This article outlines a comprehensive, step‑by‑step framework for establishing a high‑availability system in large‑scale banking IT, covering goal definition, logical architecture, service classification, key activity identification, capability upgrades, monitoring, emergency‑response asset creation, technical debt tracking, and periodic post‑mortem redesign.

MonitoringOperationsProcess Design

0 likes · 10 min read

How to Build an Organizational High‑Availability Mechanism for Banking IT Production Issues

Ray's Galactic Tech

Sep 26, 2025 · Operations

Master Spring Boot Admin: Real‑Time Monitoring for Microservices

Spring Boot Admin is an open‑source tool that provides real‑time health checks, JVM metrics, log management, environment inspection, JMX control, and customizable alerts for Spring Boot applications, and this guide explains its core features, architecture, quick setup, advanced security, notification, Actuator integration, and production best practices.

AdminJavaMonitoring

0 likes · 7 min read

Master Spring Boot Admin: Real‑Time Monitoring for Microservices

Ray's Galactic Tech

Sep 26, 2025 · Cloud Native

How to Deploy Production-Ready Spring Boot Apps on Kubernetes (V2 Guide)

Learn step-by-step how to prepare, containerize, and securely deploy a Spring Boot application on Kubernetes, covering health checks, metrics, logging, JVM tuning, multi-stage Docker builds, Helm-like resources, ConfigMaps, Secrets, Ingress, HPA, monitoring, CI/CD pipelines, and rollback strategies for production-grade reliability.

DockerKubernetesMonitoring

0 likes · 9 min read

How to Deploy Production-Ready Spring Boot Apps on Kubernetes (V2 Guide)

DevOps Operations Practice

Sep 24, 2025 · Cloud Native

How to Seamlessly Transition from Traditional Ops to Cloud Native: A Practical Guide

This article outlines the fundamental differences between traditional operations and cloud‑native practices, presents a four‑step migration strategy—including containerization, Kubernetes adoption, monitoring overhaul, and cultural shift—and highlights common pitfalls and measurable outcomes for a successful digital transformation.

ContainerizationMonitoringdigital transformation

0 likes · 7 min read

How to Seamlessly Transition from Traditional Ops to Cloud Native: A Practical Guide

Ops Community

Sep 24, 2025 · Operations

How Ops Engineers Can Stop Online Outages in Minutes: A Proven Emergency Playbook

This article outlines why a solid incident‑response plan is critical, describes typical failure scenarios, introduces the 3‑5‑10 rule for rapid diagnosis and mitigation, provides ready‑to‑run scripts for system checks, traffic throttling, service rollback, and showcases automation, AIOps and chaos‑engineering techniques to turn reactive firefighting into proactive resilience.

Incident ResponseMonitoringaiops

0 likes · 18 min read

How Ops Engineers Can Stop Online Outages in Minutes: A Proven Emergency Playbook

MaGe Linux Operations

Sep 24, 2025 · Operations

How a 3 AM MySQL Crash Taught Me Essential Ops Lessons

This article recounts a 3 AM MySQL outage, analyzes its root causes, and shares comprehensive operational strategies—including index optimization, connection‑pool tuning, slow‑query fixing, replication lag handling, monitoring metrics, automation scripts, performance tuning, security hardening, and future trends—to help DBAs prevent and resolve similar incidents.

AutomationDatabase operationsMonitoring

0 likes · 15 min read

How a 3 AM MySQL Crash Taught Me Essential Ops Lessons

macrozheng

Sep 23, 2025 · Operations

How a Visual Bash Script Can Simplify SpringBoot Service Management and Deployment

Manual start‑stop, unclear status, scattered logs and risky rollbacks make SpringBoot production deployments error‑prone, while a visual, configuration‑driven Bash manager provides an intuitive UI, real‑time monitoring, intelligent start/stop, batch operations and automated deployment to dramatically improve efficiency and reliability.

Bash scriptDeployment AutomationMonitoring

0 likes · 22 min read

How a Visual Bash Script Can Simplify SpringBoot Service Management and Deployment

Java One

Sep 21, 2025 · Operations

Mastering Prometheus rate, irate, and increase: When and How to Use Each

This article explains how Prometheus’s rate, irate, and increase functions calculate counter growth rates, handle counter resets, and differ in smoothing and responsiveness, guiding you to choose the appropriate function for monitoring request rates, CPU usage, and other metrics.

MetricsMonitoringPrometheus

0 likes · 7 min read

Mastering Prometheus rate, irate, and increase: When and How to Use Each

Ray's Galactic Tech

Sep 21, 2025 · Cloud Native

How to Deploy a High‑Availability RocketMQ Cluster on Kubernetes with Helm

Learn a step‑by‑step solution to deploy a production‑grade RocketMQ cluster on Kubernetes, covering architecture design with StatefulSets, Helm chart or native YAML configurations, persistent storage, external access, monitoring, security hardening, and one‑click installation commands.

CloudNativeKubernetesMonitoring

0 likes · 10 min read

How to Deploy a High‑Availability RocketMQ Cluster on Kubernetes with Helm

IT Architects Alliance

Sep 20, 2025 · Operations

Mastering Microservice Governance: Tracing, Config, and Monitoring Strategies

This article explores the three core challenges of microservice governance—distributed tracing, centralized configuration management, and comprehensive monitoring—offering practical solutions, tool comparisons, and best‑practice guidelines to help architects build reliable, observable, and maintainable systems.

Cloud NativeConfiguration ManagementMonitoring

0 likes · 12 min read

Mastering Microservice Governance: Tracing, Config, and Monitoring Strategies

Ops Community

Sep 19, 2025 · Operations

From Midnight Outage to Zero Downtime: Mastering NFS High‑Availability

This article recounts a critical NFS failure that caused massive loss, then walks through practical high‑availability designs—including Keepalived + DRBD, GlusterFS migration, and cloud‑native CSI storage—while sharing real‑world pitfalls, monitoring strategies, and forward‑looking recommendations for resilient file‑system operations.

Distributed File SystemMonitoringNFS

0 likes · 12 min read

From Midnight Outage to Zero Downtime: Mastering NFS High‑Availability

MaGe Linux Operations

Sep 17, 2025 · Operations

Unlock 5 CI/CD Ops Secrets to Triple Deployment Speed

This comprehensive guide reveals essential CI/CD operational techniques—from pipeline bottleneck detection and Docker multi‑stage builds to parallel execution, smart testing, blue‑green and canary deployments, full‑stack monitoring, cost‑saving cloud strategies, and a real‑world e‑commerce case study—helping teams dramatically boost efficiency, reliability, and security.

AutomationDockerKubernetes

0 likes · 46 min read

Unlock 5 CI/CD Ops Secrets to Triple Deployment Speed

Linux Tech Enthusiast

Sep 16, 2025 · Operations

A Comprehensive Guide to Linux Performance Optimization

This article provides an in‑depth, step‑by‑step walkthrough of Linux performance optimization, covering key metrics such as throughput and latency, how to interpret average load, CPU and memory usage, context‑switch analysis, common bottlenecks, and the most effective tools (vmstat, pidstat, perf, strace, dstat, etc.) with concrete command examples and real‑world case studies to help you diagnose and resolve performance issues.

MonitoringOptimizationperformance

0 likes · 36 min read

A Comprehensive Guide to Linux Performance Optimization

DevOps Coach

Sep 15, 2025 · Operations

10 Underrated Linux Tools Every Sysadmin Should Master

This guide presents ten lesser‑known but powerful Linux utilities—such as at, systemd‑run, tuned, lsof/ss, journalctl, chattr, MOTD/issue, watch/diff, strace/ltrace, and hidden cron checks—each with practical examples to boost daily sysadmin efficiency and confidence.

AutomationLinuxMonitoring

0 likes · 7 min read

10 Underrated Linux Tools Every Sysadmin Should Master

IT Architects Alliance

Sep 14, 2025 · Operations

How to Build Truly High‑Availability Systems: Principles, Patterns & Code

This article explores the core concepts, design principles, and practical code examples for building high‑availability architectures, covering fault isolation, load balancing, data replication, monitoring, and cost‑benefit considerations to keep large‑scale services running reliably.

Cloud NativeMonitoringSystem Design

0 likes · 11 min read

How to Build Truly High‑Availability Systems: Principles, Patterns & Code

Raymond Ops

Sep 14, 2025 · Operations

Mastering Concurrency: Optimize Nginx, HAProxy & Keepalived for High‑Performance Servers

This article explains the fundamentals of concurrency, distinguishes connections from requests, shows how to calculate and tune maximum concurrent connections for Nginx and HAProxy, covers system resource limits, demonstrates real‑time monitoring with stub_status, and provides practical load‑testing and Prometheus monitoring guidance.

AB testingConcurrencyHAProxy

0 likes · 15 min read

Mastering Concurrency: Optimize Nginx, HAProxy & Keepalived for High‑Performance Servers

Ops Community

Sep 14, 2025 · Operations

Boost Linux Ops 10×: Master Systemd Service Management from Beginner to Pro

This comprehensive guide walks you through Systemd fundamentals, core architecture, unit types, practical service creation, socket activation, timer units, performance tuning, resource control, security hardening, debugging, and production best practices, empowering Linux administrators to dramatically improve service management efficiency and reliability.

MonitoringService ManagementSystemd

0 likes · 28 min read

Boost Linux Ops 10×: Master Systemd Service Management from Beginner to Pro

Java Tech Enthusiast

Sep 14, 2025 · Operations

How to Use Java Agent for Non‑Intrusive SpringBoot Monitoring

Learn how to implement a Java Agent that enables non‑intrusive monitoring of SpringBoot applications, covering agent basics, bytecode manipulation with Byte Buddy, metric collection via Micrometer, Prometheus/Grafana integration, and advanced extensions such as JVM metrics, HTTP client tracing, and distributed tracing.

MicrometerMonitoringPrometheus

0 likes · 16 min read

How to Use Java Agent for Non‑Intrusive SpringBoot Monitoring

Rare Earth Juejin Tech Community

Sep 11, 2025 · Backend Development

How a Single Looped Serialization Turned a Major Promotion into a System Avalanche

A 2021 midnight promotion in Hangzhou crashed when a poorly placed loop serialized a massive object twenty times per request, overwhelming CPU, thread pools, and the Tair cache, leading to a full‑stack service avalanche that was only resolved after a half‑hour emergency rollback.

CachingIncident ResponseMonitoring

0 likes · 10 min read

How a Single Looped Serialization Turned a Major Promotion into a System Avalanche

Architect

Sep 10, 2025 · Operations

Building System Stability: A Backend Engineer’s Guide to Risk Management

This article explores system stability from a backend perspective, defining its academic and engineering meanings, quantifying metrics like SLA, MTBF and MTTR, analyzing why stability matters, outlining the challenges faced, and presenting practical steps—including resource consensus, goal setting, awareness cultivation, production standards, monitoring, emergency response, and regular inspections—to effectively build and maintain stable systems.

MonitoringOperationsrisk management

0 likes · 25 min read

Building System Stability: A Backend Engineer’s Guide to Risk Management

NiuNiu MaTe

Sep 10, 2025 · Backend Development

How to Quickly Resolve Message Queue Backlog and Keep Your System Stable

This article explains what message queue backlog is, why it harms system latency, and provides practical, step‑by‑step strategies—including temporary consumer scaling, prioritizing core messages, queue splitting, root‑cause analysis, performance tuning, message design, dead‑letter handling, traffic control, capacity planning, and monitoring—to eliminate backlog and ensure reliable asynchronous processing.

BacklogMessage QueueMonitoring

0 likes · 21 min read

How to Quickly Resolve Message Queue Backlog and Keep Your System Stable

Liangxu Linux

Sep 8, 2025 · Operations

Unlock 30‑50% Faster Linux Performance: A Complete CPU, Memory & Disk I/O Tuning Guide

This article provides a systematic, end‑to‑end guide for diagnosing and optimizing Linux system performance across CPU, memory, and disk I/O layers, offering concrete commands, metric thresholds, real‑world case studies, and advanced techniques such as NUMA and container tuning.

CPULinuxMonitoring

0 likes · 14 min read

Unlock 30‑50% Faster Linux Performance: A Complete CPU, Memory & Disk I/O Tuning Guide

Ops Community

Sep 8, 2025 · Operations

Mastering Distributed Log Architecture: From Flume to ELK and Beyond

This comprehensive guide walks you through the challenges of large‑scale log collection, real‑time processing, storage optimization, and visualization, detailing practical configurations for Flume, Logstash, Elasticsearch, Kibana, Filebeat, Kafka, Kubernetes, and future AIOps integrations to build a reliable, cost‑effective distributed logging system.

ELKFlumeKafka

0 likes · 24 min read

Mastering Distributed Log Architecture: From Flume to ELK and Beyond

MaGe Linux Operations

Sep 7, 2025 · Databases

Master MySQL Slow Query Analysis: Proven SQL Optimization Techniques to Boost Performance

This comprehensive guide walks you through diagnosing MySQL slow queries, from identifying root causes and configuring slow‑query logs to applying advanced indexing, query‑rewriting, and monitoring techniques—complete with real‑world case studies that demonstrate how to cut query times from seconds to milliseconds.

MonitoringMySQLSQL optimization

0 likes · 28 min read

Master MySQL Slow Query Analysis: Proven SQL Optimization Techniques to Boost Performance

Selected Java Interview Questions

Sep 7, 2025 · Operations

How Tianji Unifies Website Analytics, Server Monitoring, and Alerts in One Lightweight Platform

Tianji is an open‑source all‑in‑one monitoring solution that combines website analytics, uptime monitoring, and server health checks with multi‑channel alerts, offering Docker‑based quick deployment, a responsive React dashboard, and extensible alert scripts for developers and small teams.

DockerMonitoringalerting

0 likes · 6 min read

How Tianji Unifies Website Analytics, Server Monitoring, and Alerts in One Lightweight Platform

Architect

Sep 6, 2025 · Operations

Master High-Concurrency Nginx: Core Configs, Advanced Tuning, and Real-World Checklist

This guide walks you through the common high‑traffic pain points of Nginx, explains why configuration and tuning matter more than hardware, and provides step‑by‑step core, advanced, OS‑level, monitoring, and troubleshooting configurations to reliably handle tens of thousands of concurrent connections.

LinuxMonitoringPerformance tuning

0 likes · 11 min read

Master High-Concurrency Nginx: Core Configs, Advanced Tuning, and Real-World Checklist

Ops Community

Sep 4, 2025 · Databases

Avoid Redis Nightmares: Proven Deployment and Optimization Guide

This comprehensive guide walks you through Redis production deployment, persistence strategies, performance tuning, security hardening, real‑world case studies, and failure recovery, helping you prevent common pitfalls and keep your cache layer reliable and fast.

MonitoringOptimizationPersistence

0 likes · 21 min read

Avoid Redis Nightmares: Proven Deployment and Optimization Guide

dbaplus Community

Sep 3, 2025 · Operations

How to Build System Stability: Definitions, Challenges, and Practical Steps

This article explains what system stability means, why it matters, the difficulties of building it, and provides a detailed, step‑by‑step framework—including risk formulas, resource planning, monitoring, and emergency response—to help backend teams improve reliability and reduce business impact.

Incident ResponseMonitoringrisk management

0 likes · 23 min read

How to Build System Stability: Definitions, Challenges, and Practical Steps

Efficient Ops

Sep 3, 2025 · Operations

Master Zabbix: From Core Concepts to Full Installation on Debian/Ubuntu

This guide introduces Zabbix's monitoring architecture, key features, step‑by‑step installation on Debian/Ubuntu, configuration of server, database, and agents, plus essential troubleshooting commands for a reliable monitoring setup.

DevOpsInstallationLinux

0 likes · 7 min read

Master Zabbix: From Core Concepts to Full Installation on Debian/Ubuntu

ITPUB

Sep 3, 2025 · Backend Development

How We Boosted Kafka Throughput by 35% with Filebeat Tuning and Compression Tricks

This case study details how a high‑traffic Kafka logging cluster was optimized by analyzing low compression ratios, tuning Filebeat parameters, adjusting memory queues and round‑robin settings, and validating the changes through gray‑scale tests, resulting in up to 35% higher throughput and significant resource savings.

FilebeatKafkaMonitoring

0 likes · 10 min read

How We Boosted Kafka Throughput by 35% with Filebeat Tuning and Compression Tricks

dbaplus Community

Sep 1, 2025 · Operations

How to Keep VictoriaMetrics Stable During Sudden Metric Surges

This article outlines practical strategies for protecting VictoriaMetrics storage under bursty metric traffic, covering communication with business teams, splitting deployments, choosing single‑node versus cluster setups, key monitoring metrics, separate storage for self‑monitoring, the VMUI Explore UI, and techniques for discarding high‑cardinality metrics.

MetricsMonitoringVictoriaMetrics

0 likes · 10 min read

How to Keep VictoriaMetrics Stable During Sudden Metric Surges

Java Architect Essentials

Aug 31, 2025 · Backend Development

How Global Exception Handling Can Slash Crash Rates by 90% in Java Services

This article explains why uncaught exceptions can cripple a Java backend, demonstrates a three‑layer global exception handling strategy with Spring Boot, shows how circuit‑breaker rules further protect services, and provides real‑world data proving crash rates can drop from over 4% to under 0.1%.

Backend DevelopmentCircuit BreakerException Handling

0 likes · 8 min read

How Global Exception Handling Can Slash Crash Rates by 90% in Java Services

Mingyi World Elasticsearch

Aug 30, 2025 · Operations

INFINI Console FAQ: Enterprise‑Grade Unified Elasticsearch Management

The article introduces INFINI Console, an open‑source, lightweight platform for unified, multi‑cluster and cross‑version Elasticsearch governance, compares it with Kibana, details deployment options, enterprise‑level features such as monitoring, alerting and security, and analyzes cost advantages and practical migration scenarios.

Cluster ManagementElasticsearchINFINI Console

0 likes · 13 min read

INFINI Console FAQ: Enterprise‑Grade Unified Elasticsearch Management

Ops Community

Aug 30, 2025 · Information Security

Master Linux Server Hardening: From Manual Steps to Automated Scripts

This comprehensive guide walks you through Linux server security hardening, covering real-world incident analysis, a detailed checklist of system, SSH, firewall, kernel and logging configurations, plus ready-to-use Bash scripts, Ansible playbooks, Docker hardening, monitoring tools, and actionable steps to build an enterprise‑grade defense.

DockerHardeningLinux

0 likes · 17 min read

Master Linux Server Hardening: From Manual Steps to Automated Scripts

Code Mala Tang

Aug 30, 2025 · Backend Development

How to Log API Requests Without Slowing Down Your Server

Effective API logging is essential for debugging and compliance, but naive synchronous logging can block the event loop, exhaust disk I/O, and degrade performance; this guide explains why, and provides ten practical steps—including asynchronous loggers, buffering, offloading, sensitive data masking, and monitoring—to keep your server fast and reliable.

API loggingAsynchronousLog Management

0 likes · 15 min read

How to Log API Requests Without Slowing Down Your Server

MaGe Linux Operations

Aug 29, 2025 · Operations

How to Supercharge Nginx for Millions of QPS: A Complete Guide

Discover proven strategies to optimize Nginx under extreme traffic, covering benchmark testing, kernel tuning, configuration tweaks, caching, load balancing, SSL hardening, monitoring, and real-world case studies that demonstrate how to achieve stable high‑QPS performance while minimizing latency and resource usage.

MonitoringOptimizationhigh-concurrency

0 likes · 22 min read

How to Supercharge Nginx for Millions of QPS: A Complete Guide

ITPUB

Aug 29, 2025 · Operations

Why Operations Engineers Are Anything But Low‑Skill: A Deep Dive into Their Real Technical Challenges

The article debunks the myth that operations work is low‑skill by detailing the extensive monitoring, Linux, networking, security, and firefighting expertise required, illustrating real‑world scenarios, tools, and best‑practice recommendations that highlight the critical, high‑level technical role of ops engineers.

DevOpsLinuxMonitoring

0 likes · 17 min read

Why Operations Engineers Are Anything But Low‑Skill: A Deep Dive into Their Real Technical Challenges

Architecture Digest

Aug 28, 2025 · Operations

Step‑by‑Step Guide to Building a Full Grafana‑Prometheus Monitoring System with Alerts

This tutorial walks you through installing and configuring Grafana and Prometheus, adding exporters for system metrics, MySQL, RabbitMQ, Redis and TiDB, setting up dashboards, creating alert rules, and using Grafana's HTTP API for automation, providing a complete end‑to‑end monitoring solution.

GrafanaMonitoringPrometheus

0 likes · 24 min read

Step‑by‑Step Guide to Building a Full Grafana‑Prometheus Monitoring System with Alerts

Raymond Ops

Aug 28, 2025 · Operations

Step-by-Step Guide to Install, Configure, and Use Prometheus for Monitoring

This tutorial walks you through downloading Prometheus, setting up self‑monitoring, starting the server, opening firewall ports, exploring the built‑in UI, adding Node Exporter targets, configuring scrape jobs, creating recording rules, and visualizing metrics with queries and graphs.

MonitoringPrometheusRecording Rules

0 likes · 10 min read

Step-by-Step Guide to Install, Configure, and Use Prometheus for Monitoring

Architect

Aug 27, 2025 · Operations

Build a Full Grafana‑Prometheus Monitoring Stack for MySQL, RabbitMQ, Redis & TiDB

This guide walks you through installing and configuring Prometheus and Grafana, comparing Prometheus with Zabbix, adding exporters for system metrics, MySQL, RabbitMQ, Redis and TiDB, setting up dashboards, plugins, and email alerts to create a comprehensive monitoring solution.

GrafanaMonitoringMySQL

0 likes · 27 min read

Build a Full Grafana‑Prometheus Monitoring Stack for MySQL, RabbitMQ, Redis & TiDB

Linux Ops Smart Journey

Aug 26, 2025 · Operations

Why the Grafana Table Panel Is the Ultimate Tool for Precise Monitoring

This article explains how the Grafana Table panel serves as a versatile, data‑driven Swiss‑army‑knife for deep troubleshooting, covering its advantages, typical use cases, step‑by‑step configuration, PromQL queries, JSON panel definition, and visual customization tips.

GrafanaMonitoringPromQL

0 likes · 7 min read

Why the Grafana Table Panel Is the Ultimate Tool for Precise Monitoring

MaGe Linux Operations

Aug 24, 2025 · Operations

Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox

This comprehensive guide shares a veteran ops engineer's real‑world troubleshooting mindset, the SEAL framework, a curated toolbox of monitoring, logging, performance, and network utilities, detailed case studies, incident‑response grading, automation scripts, and future‑ready AIOps practices for keeping production systems stable.

AutomationIncident ResponseMonitoring

0 likes · 19 min read

Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox

MaGe Linux Operations

Aug 21, 2025 · Operations

Master Docker Volume Management: From Basics to Enterprise‑Grade Persistence & Backup

This comprehensive guide walks you through Docker storage challenges, explains temporary, bind‑mount and named volumes, presents tiered storage architectures and dynamic scripts, and provides production‑grade backup, monitoring, and performance‑tuning strategies to ensure reliable data persistence in containerized environments.

Monitoringbackupops

0 likes · 13 min read

Master Docker Volume Management: From Basics to Enterprise‑Grade Persistence & Backup

Linux Ops Smart Journey

Aug 20, 2025 · Operations

How to Turn Abstract Metrics into Intuitive Gauges with Grafana

This guide explains why Grafana's Gauge panel creates a powerful visual metaphor for system pressure, walks through creating the gauge, configuring PromQL queries, setting panel options, thresholds, and JSON definitions, and shows how to produce clear, boss‑friendly monitoring dashboards.

Gauge panelGrafanaJSON configuration

0 likes · 5 min read

How to Turn Abstract Metrics into Intuitive Gauges with Grafana

Tech Freedom Circle

Aug 20, 2025 · Backend Development

P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

During the Double‑11 shopping festival, a leading e‑commerce platform suffered a P0 outage when its Eureka service‑discovery cluster overloaded, triggering a full‑chain failure that lasted 2 hours 42 minutes and caused losses exceeding 1.2 billion yuan; the article dissects the timeline, root causes, capacity mis‑planning, monitoring gaps, and remediation strategies.

JavaMonitoringcapacity planning

0 likes · 34 min read

P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

macrozheng

Aug 20, 2025 · Operations

Master Server Monitoring with Checkmate: Install, Docker Setup & Real‑Time Insights

This guide introduces Checkmate, a modern open‑source monitoring platform, and walks you through its key features, Docker‑based installation, and step‑by‑step usage for website, server, Docker container, and hardware monitoring, plus theme customization.

MonitoringOperationsServer

0 likes · 7 min read

Master Server Monitoring with Checkmate: Install, Docker Setup & Real‑Time Insights

Wukong Talks Architecture

Aug 19, 2025 · Backend Development

From Monolith to Microservices: A Real‑World Online Supermarket Migration Story

This article walks through the evolution of an online supermarket from a simple monolithic website to a fully‑featured microservice architecture, highlighting the challenges, design decisions, component choices, monitoring, tracing, testing, and the trade‑offs of service mesh versus custom frameworks.

DeploymentMonitoringarchitecture

0 likes · 22 min read

From Monolith to Microservices: A Real‑World Online Supermarket Migration Story

MaGe Linux Operations

Aug 19, 2025 · Big Data

Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies

This article provides a comprehensive guide to building enterprise‑grade, highly available Kafka clusters, covering architecture design, hardware planning, production‑level broker configurations, ISR management, monitoring, fault‑tolerance procedures, rolling upgrades, capacity planning, and automation scripts for seamless operations.

KafkaMonitoringOperations

0 likes · 16 min read

Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies

Ops Community

Aug 19, 2025 · Information Security

Master Linux Security: Advanced firewalld Rules & SELinux Context Management

This guide walks you through hardening Linux servers by using firewalld's zone‑based advanced rules, rich rules, and IPSET collections, combined with precise SELinux context management, practical scripts, troubleshooting tips, and production‑grade best practices to build a multi‑layered defense.

AutomationLinuxMonitoring

0 likes · 11 min read

Master Linux Security: Advanced firewalld Rules & SELinux Context Management

Linux Ops Smart Journey

Aug 19, 2025 · Operations

Mastering Grafana Pie Charts: When and How to Use Them Effectively

Learn when to choose a Pie Chart in Grafana, explore common use cases like browser market share and HTTP status codes, and follow step‑by‑step instructions—including panel options, legend, tooltip, and JSON configuration—to create clear, proportion‑focused visualizations.

GrafanaMonitoringPromQL

0 likes · 5 min read

Mastering Grafana Pie Charts: When and How to Use Them Effectively

Cognitive Technology Team

Aug 19, 2025 · Operations

How Bilibili Scaled Server Fault Management with Automated Detection and Repair

This article details Bilibili's evolving server fault management architecture, covering fault classification, the shortcomings of manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerts, and end‑to‑end repair automation.

MonitoringOperationsin‑band collection

0 likes · 18 min read

How Bilibili Scaled Server Fault Management with Automated Detection and Repair

360 Zhihui Cloud Developer

Aug 19, 2025 · Big Data

How to Accurately Size Kafka Clusters: Real‑World Disk I/O Tests and Capacity Planning

This article shares 360 Group's systematic Kafka capacity‑planning methodology, covering hardware performance analysis, disk I/O benchmarking, cluster configuration, load‑testing procedures, observed write‑read dynamics, and practical recommendations for reliable Kafka deployments.

KafkaMonitoringbig-data

0 likes · 11 min read

How to Accurately Size Kafka Clusters: Real‑World Disk I/O Tests and Capacity Planning

Mike Chen's Internet Architecture

Aug 16, 2025 · Big Data

Mastering ELK: A Complete Guide to Elasticsearch, Logstash, and Kibana

This article introduces the ELK stack—Elasticsearch, Logstash, and Kibana—explaining each component, their roles in large‑scale log processing, and the step‑by‑step workflow for collecting, storing, and visualizing log data in modern big‑data environments.

Big DataELKElasticsearch

0 likes · 4 min read

Mastering ELK: A Complete Guide to Elasticsearch, Logstash, and Kibana

Linux Ops Smart Journey

Aug 14, 2025 · Operations

Master Grafana Time Series Panel: From Basics to Advanced Configuration

This guide explains why Grafana’s Time Series panel is essential for proactive monitoring, walks through browser selection, PromQL queries, panel options such as titles, tooltips, legends, axes, graph styles, and provides a ready‑to‑use JSON configuration to visualize trends and detect anomalies.

GrafanaMonitoringOperations

0 likes · 8 min read

Master Grafana Time Series Panel: From Basics to Advanced Configuration

iQIYI Technical Product Team

Aug 14, 2025 · Operations

How Automated Inspection Boosts System Reliability and Prevents Decay

This article explains how a systematic, automated inspection platform can proactively identify hidden risks, avoid system decay, enforce unified standards, and enhance stability, security, and operational efficiency for high‑availability applications and middleware.

MonitoringOperations Automationaiops

0 likes · 9 min read

How Automated Inspection Boosts System Reliability and Prevents Decay

Linux Ops Smart Journey

Aug 12, 2025 · Operations

How to Add Interactive Variables to Grafana Dashboards for Dynamic Monitoring

This guide explains what Grafana variables are, why they act like a dashboard control knob, and provides step‑by‑step instructions with screenshots and JSON examples for creating data‑source, business‑tag, and JSON‑file variables to build interactive monitoring dashboards.

GrafanaMonitoringOperations

0 likes · 6 min read

How to Add Interactive Variables to Grafana Dashboards for Dynamic Monitoring

DevOps Operations Practice

Aug 11, 2025 · Operations

Zen Master’s Secrets to the Ultimate State of Operations

Through a series of dialogues with a Zen master, the article humorously explores the highest level of operations—automation that runs itself, balanced alerting, cloud migration, reliable backups, high‑availability, stability through chaos engineering, and the ultimate goal of making systems operate without human intervention.

AutomationMonitoringOperations

0 likes · 5 min read

Zen Master’s Secrets to the Ultimate State of Operations

Liangxu Linux

Aug 10, 2025 · Databases

Master MySQL Backup & Recovery: Complete Guide for Reliable Data Protection

This comprehensive guide explains MySQL data backup and recovery strategies, covering backup types, planning principles, built‑in tools like mysqldump and mysqlpump, third‑party solutions such as Percona XtraBackup, scripting for automated schedules, storage options, encryption, monitoring, troubleshooting, and best‑practice recommendations to ensure data safety and business continuity.

AutomationDatabaseMonitoring

0 likes · 22 min read

Master MySQL Backup & Recovery: Complete Guide for Reliable Data Protection

Sohu Smart Platform Tech Team

Aug 9, 2025 · Backend Development

Diagnosing Java Performance Bottlenecks with Skywalking, Arthas and Java Agents

This article explains how Java developers can locate and resolve performance issues by using Skywalking and Arthas together, covering class loading mechanisms, Java Agent instrumentation, bytecode manipulation techniques, and practical command examples for monitoring, tracing, and hot‑spot analysis.

ArthasJavaJava Agent

0 likes · 16 min read

Diagnosing Java Performance Bottlenecks with Skywalking, Arthas and Java Agents

MaGe Linux Operations

Aug 7, 2025 · Cloud Native

Mastering Kubernetes Networking: Choose the Right CNI Plugin and Boost Performance

This comprehensive guide walks you through Kubernetes' network model, explains why networking is its biggest pain point, compares major CNI plugins with real‑world performance data, and provides a step‑by‑step decision framework, tuning tips, troubleshooting methods, and monitoring best practices for production environments.

CNICalicoCilium

0 likes · 24 min read

Mastering Kubernetes Networking: Choose the Right CNI Plugin and Boost Performance

Volcano Engine Developer Services

Aug 7, 2025 · Operations

How to Collect and Analyze JuiceFS Access Logs with Volcengine TLS

This article explains how to gather JuiceFS access logs using the LogCollector agent, parse and structure them with TLS, design index fields, build analytical dashboards, run advanced SQL queries for write‑IO distribution, sequential‑read ratios, overwrite detection, file‑lifecycle analysis, and set up real‑time monitoring and alerting for performance anomalies.

JuiceFSLogCollectorMonitoring

0 likes · 22 min read

How to Collect and Analyze JuiceFS Access Logs with Volcengine TLS

Sohu Smart Platform Tech Team

Aug 7, 2025 · Backend Development

Boost Nginx Performance: Practical OpenResty Guide for Blacklists, Rate Limiting, A/B Testing & Monitoring

This article presents a hands‑on guide to using OpenResty—Lua‑enhanced Nginx—for implementing static and dynamic blacklists, fine‑grained rate limiting, A/B testing via upstream selection, and real‑time service quality monitoring, all with production‑ready code examples.

A/B testingBlacklistLua

0 likes · 21 min read

Boost Nginx Performance: Practical OpenResty Guide for Blacklists, Rate Limiting, A/B Testing & Monitoring

DevOps Operations Practice

Aug 7, 2025 · Operations

Mastering Operations: Tools, Processes, and Architecture for Top‑Notch SRE

This article outlines how proactive monitoring, automation, disciplined processes, robust architecture, and chaos engineering empower operations engineers to prevent failures, manage changes, ensure reliable backups, and build self‑healing systems that balance stability, innovation, cost, and human decision‑making.

AutomationMonitoringOperations

0 likes · 5 min read

Mastering Operations: Tools, Processes, and Architecture for Top‑Notch SRE

dbaplus Community

Aug 5, 2025 · Backend Development

10 Logging Best Practices to Diagnose Production Issues Efficiently

This article presents ten practical rules for writing high‑quality logs—covering format consistency, stack traces, log levels, parameter completeness, asynchronous handling, traceability, dynamic configuration, structured storage, and intelligent monitoring—to help engineers quickly pinpoint problems in high‑traffic systems.

LoggingMonitoringlogback

0 likes · 9 min read

10 Logging Best Practices to Diagnose Production Issues Efficiently

JakartaEE China Community

Aug 5, 2025 · Operations

How to Monitor Java Virtual Threads Effectively

This article explains the internal mechanics of Java virtual threads, the role of Continuation, pinned threads, and carrier threads, and provides concrete monitoring techniques using JVM flags, JFR events, and framework-specific considerations for Helidon and Quarkus.

ForkJoinPoolHelidonJFR

0 likes · 11 min read

How to Monitor Java Virtual Threads Effectively

Alibaba Cloud Big Data AI Platform

Aug 4, 2025 · Operations

Demystifying Linux Load: Calculation, Tools, and Advanced Monitoring

This article thoroughly explains the Linux load average concept, its kernel-level calculation, how to dissect load values using tools like load2process and load2pid, introduces the load5s kernel module for finer-grained monitoring, and provides scripts and techniques for effective load analysis and troubleshooting.

LinuxLoad AverageMonitoring

0 likes · 20 min read

Demystifying Linux Load: Calculation, Tools, and Advanced Monitoring

MaGe Linux Operations

Jul 28, 2025 · Information Security

How to Detect and Respond to Server Intrusions: A Complete 24‑Hour Incident Response Guide

This guide walks operations and security engineers through recognizing intrusion signs, executing a step‑by‑step 24‑hour response, collecting forensic evidence, cleaning and hardening the system, and building proactive monitoring to protect servers from future attacks.

AutomationForensicsIncident Response

0 likes · 16 min read

How to Detect and Respond to Server Intrusions: A Complete 24‑Hour Incident Response Guide

Architecture Breakthrough

Jul 28, 2025 · Operations

Turn Point Fixes into Systemic Solutions: A Practical Optimization Framework

Effective technical optimization requires moving from isolated, point‑style ideas to a comprehensive, measurable framework that quantifies goals, assesses gaps, designs capacity, monitors key services and links, and establishes clear compensation and incident‑handling procedures, ensuring a complete, closed‑loop solution.

MonitoringOperationscapacity planning

0 likes · 8 min read

Turn Point Fixes into Systemic Solutions: A Practical Optimization Framework

MaGe Linux Operations

Jul 25, 2025 · Operations

5 Game‑Changing One‑Liner Shell Commands Every Ops Engineer Must Know

This article shares five battle‑tested one‑line Shell commands that instantly diagnose server health, analyze logs, rank process resources, troubleshoot network connections, and clean disk space, plus practical tips and mindset advice to help operations engineers solve critical incidents faster and more reliably.

LinuxMonitoringOne-liner

0 likes · 10 min read

5 Game‑Changing One‑Liner Shell Commands Every Ops Engineer Must Know

Open Source Linux

Jul 25, 2025 · Operations

Why Does My Container Show 900% CPU? Uncovering JVM and Cgroup Mismatches

An experienced ops engineer investigates a night‑time Grafana alert showing 900% CPU usage, discovers a mismatch between JVM‑detected cores and container limits, explains the root cause, and presents a three‑step solution with code snippets, monitoring tweaks, and performance results.

CPUJVMKubernetes

0 likes · 9 min read

Why Does My Container Show 900% CPU? Uncovering JVM and Cgroup Mismatches

dbaplus Community

Jul 24, 2025 · Operations

How Bilibili Scales Server Fault Management with Automated Detection and Repair

This article details Bilibili's approach to handling explosive growth in server count by classifying faults, identifying shortcomings of manual processes, and implementing an automated, end‑to‑end detection, rule‑based alerting, and repair workflow that combines in‑band and out‑of‑band data collection to achieve near‑perfect coverage and accuracy.

Data CenterMonitoringfault detection

0 likes · 17 min read

How Bilibili Scales Server Fault Management with Automated Detection and Repair

MaGe Linux Operations

Jul 24, 2025 · Operations

Mastering Production Backup Architecture: A Proven 3‑2‑1 Disaster Recovery Blueprint

This article presents a production‑validated, multi‑layer website backup architecture—including code, database, and file storage strategies, automation scripts, monitoring dashboards, performance tuning, and AI‑driven optimization—to ensure rapid recovery, cost efficiency, and business continuity.

AutomationMonitoringbackup

0 likes · 14 min read

Mastering Production Backup Architecture: A Proven 3‑2‑1 Disaster Recovery Blueprint

Ops Community

Jul 24, 2025 · Operations

How a Small E‑commerce Site Scaled to 10 Million Daily Visits: Real‑World Architecture Lessons

This article details a small‑to‑mid‑size e‑commerce platform’s journey from a few thousand daily page views to ten million, covering business challenges, three architecture evolution stages, key technical solutions, performance optimizations, cost‑control strategies, and practical automation tips.

MonitoringOperationsPerformance Optimization

0 likes · 14 min read

How a Small E‑commerce Site Scaled to 10 Million Daily Visits: Real‑World Architecture Lessons

Ops Community

Jul 23, 2025 · Operations

Why Did My JVM Show 900% CPU? Uncovering Container Limit Misconfigurations

An 8‑year ops veteran investigates a night‑time alert showing 900% CPU usage, discovers that a JVM inside a Kubernetes pod misreads host cores while the container is limited to two CPUs, and outlines how improper thread‑pool settings and monitoring metrics caused massive throttling before presenting concrete fixes.

CPU throttlingJVMKubernetes

0 likes · 10 min read

Why Did My JVM Show 900% CPU? Uncovering Container Limit Misconfigurations

MaGe Linux Operations

Jul 23, 2025 · Operations

How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery

This article details a P0 production incident where a Kubernetes cluster became completely unresponsive due to 100% etcd database fragmentation, describing the step‑by‑step diagnosis, emergency recovery actions, root‑cause analysis, and long‑term preventive measures for reliable cluster operation.

Cluster RecoveryKubernetesMonitoring

0 likes · 12 min read

How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery

Tech Freedom Circle

Jul 22, 2025 · Backend Development

How I Resolved an 8‑Million‑Message MQ Backlog at 2 AM: A Proven Generic Solution

At 2 AM an alert triggered when a RocketMQ queue surged from 500 K to 10 M messages, causing severe latency; the article walks through root‑cause analysis, a five‑step emergency fix, long‑term architectural upgrades, monitoring, and scripts to reliably eliminate such MQ backlogs.

BacklogMessage QueueMonitoring

0 likes · 26 min read

How I Resolved an 8‑Million‑Message MQ Backlog at 2 AM: A Proven Generic Solution

High Availability Architecture

Jul 22, 2025 · Operations

How We Automated Server Fault Detection and Repair at Scale

This article explains the challenges of managing rapidly growing server fleets, outlines a systematic classification of hardware and software faults, and details an end‑to‑end automated solution that combines in‑band and out‑of‑band data collection, rule‑based detection, and fully automated repair workflows to improve fault coverage, accuracy, and recovery speed.

MonitoringOperationshardware detection

0 likes · 16 min read

How We Automated Server Fault Detection and Repair at Scale

Architect's Guide

Jul 21, 2025 · Operations

How to Achieve Five Nines: Practical High‑Availability Strategies for Modern Web Systems

This article explains key high‑availability concepts such as availability metrics, microservice modularization, load balancing, rate limiting, circuit breaking, isolation, retry strategies, rollback plans, stress testing, monitoring, and on‑call processes, providing concrete design guidelines for building resilient internet services.

Circuit BreakingMonitoringRate Limiting

0 likes · 12 min read

How to Achieve Five Nines: Practical High‑Availability Strategies for Modern Web Systems

Alibaba Cloud Big Data AI Platform

Jul 21, 2025 · Operations

Create an AI Ops Assistant Using Elasticsearch for Real‑Time Monitoring & NL Queries

This guide explains how to build an AI‑powered operations assistant with Elasticsearch that provides real‑time monitoring, natural‑language query translation, end‑to‑end automation, and lower technical barriers, covering architecture, one‑click deployment, validation steps, and resource cleanup.

AI OpsElasticsearchMonitoring

0 likes · 7 min read

Create an AI Ops Assistant Using Elasticsearch for Real‑Time Monitoring & NL Queries

Code Mala Tang

Jul 18, 2025 · Backend Development

Unlock Lightning-Fast Node.js: 8 Proven Backend Performance Hacks

Discover why a sluggish API hurts user retention, SEO, and costs, and learn eight practical Node.js backend optimization techniques—including mastering the event loop, avoiding blocking code, leveraging async/await, offloading heavy tasks, efficient JSON handling, caching strategies, database tuning, clustering, and continuous monitoring—to boost performance and scalability.

Backend PerformanceCachingMonitoring

0 likes · 8 min read

Unlock Lightning-Fast Node.js: 8 Proven Backend Performance Hacks

Ops Development & AI Practice

Jul 18, 2025 · Operations

Mastering Modern Software Operations: The Six Essential Steps for Success

Modern software operations have shifted from a post‑launch checklist to an ongoing, automated discipline, and this article outlines the six core phases—requirement planning, CI/CD automation, comprehensive monitoring, incident response, performance tuning, and security compliance—providing concrete examples and practical advice for building a resilient DevOps culture.

DevOpsIncident ManagementMonitoring

0 likes · 9 min read

Mastering Modern Software Operations: The Six Essential Steps for Success

MaGe Linux Operations

Jul 17, 2025 · Operations

Master Network Device Ops: Switches, Routers, and Firewalls Deep Dive

This comprehensive guide walks network engineers through the fundamentals and advanced techniques for operating switches, routers, and firewalls, covering configuration, performance monitoring, troubleshooting, automation, security hardening, and emerging trends like SDN and AI-driven operations.

AutomationMonitoringSwitch Configuration

0 likes · 26 min read

Master Network Device Ops: Switches, Routers, and Firewalls Deep Dive

Efficient Ops

Jul 14, 2025 · Operations

Rescuing a Critical CPU Outage: My Step-by-Step Troubleshooting Guide

After a midnight CPU alarm threatened service stability, I walked through rapid diagnosis with top and htop, identified JVM bottlenecks using jstat and async‑profiler, refactored a Java sorting algorithm, added caching, optimized database queries, containerized the service, and set up Prometheus‑Grafana alerts to prevent future incidents.

CPU troubleshootingDockerJava performance

0 likes · 7 min read

Rescuing a Critical CPU Outage: My Step-by-Step Troubleshooting Guide

Efficient Ops

Jul 13, 2025 · Operations

Mastering Modern System Operations: 6 Essential Strategies for Stability and Efficiency

This comprehensive guide outlines six critical areas of modern system operations—including real‑time monitoring, security safeguards, automation, fault diagnosis, collaborative teamwork, and process optimization—offering practical strategies and tools such as Zabbix, Prometheus, ELK, Redis, Ansible, and capacity planning to ensure stable, efficient enterprise services.

AutomationMonitoringSecurity

0 likes · 10 min read

Mastering Modern System Operations: 6 Essential Strategies for Stability and Efficiency

MaGe Linux Operations

Jul 12, 2025 · Operations

Mastering EFK: The Complete Guide to Building a Scalable Log Management System

This comprehensive guide explains the EFK (Elasticsearch, Fluentd, Kibana) log management stack, covering its components, architecture, deployment steps, log collection strategies, index optimization, monitoring, security hardening, troubleshooting and best‑practice recommendations for building a reliable, scalable logging solution in modern cloud‑native environments.

DockerEFKElasticsearch

0 likes · 17 min read

Mastering EFK: The Complete Guide to Building a Scalable Log Management System

Code Ape Tech Column

Jul 11, 2025 · Operations

How to Monitor Spring Boot Applications with Prometheus and Grafana

This guide explains how to integrate Prometheus with Spring Boot using Actuator and Micrometer, configure Docker containers, set up Grafana for visualization, and create custom metrics, providing a complete monitoring solution for microservice applications.

ActuatorGrafanaMicrometer

0 likes · 9 min read

How to Monitor Spring Boot Applications with Prometheus and Grafana

Linux Ops Smart Journey

Jul 10, 2025 · Operations

How to Monitor Libvirt with Prometheus, Nacos, and Grafana – A Step‑by‑Step Guide

This article walks you through deploying the libvirt‑exporter, registering it with Nacos for service discovery, exposing it to Prometheus, and adding a ready‑made Grafana dashboard, providing a complete monitoring solution for virtualized environments.

GrafanaMonitoringNacos

0 likes · 4 min read

How to Monitor Libvirt with Prometheus, Nacos, and Grafana – A Step‑by‑Step Guide

Qunhe Technology Quality Tech

Jul 10, 2025 · Operations

Ensuring Elasticsearch Stability: Testing, Performance, and Disaster Recovery

This article outlines a comprehensive reliability framework for Elasticsearch, covering pre‑release performance evaluation, data accuracy checks, real‑time sync delay alerts, rapid recovery strategies, performance testing methods, and disaster‑recovery measures such as multi‑cluster backup and index alias switching.

Monitoringdata synchronizationdisaster recovery

0 likes · 12 min read

Ensuring Elasticsearch Stability: Testing, Performance, and Disaster Recovery

Zhuanzhuan Tech

Jul 9, 2025 · Operations

How Apache HertzBeat Enables Agent‑Free Real‑Time Monitoring and Alerting

This guide introduces Apache HertzBeat, an open‑source real‑time monitoring and alerting platform that requires no agents, supports high‑performance clusters, offers customizable protocols, integrates with Grafana, provides plugin hot‑updates, and details its time‑wheel scheduling, cloud‑edge collaboration, and alert configuration.

ApacheClusterHertzBeat

0 likes · 22 min read

How Apache HertzBeat Enables Agent‑Free Real‑Time Monitoring and Alerting

Java Architect Essentials

Jul 8, 2025 · Operations

Turn Noisy Alerts into Precise Signals: Dynamic Thresholds & AI‑Powered Monitoring with Spring Boot

This article shows how to replace static, error‑prone alert thresholds with dynamic baselines, root‑cause analysis chains, and AI‑driven predictions in a Spring Boot‑based monitoring stack, dramatically cutting false alarms and enabling proactive fault detection.

AI predictionAlert Noise ReductionMonitoring

0 likes · 9 min read

Turn Noisy Alerts into Precise Signals: Dynamic Thresholds & AI‑Powered Monitoring with Spring Boot

Linux Ops Smart Journey

Jul 8, 2025 · Operations

How to Build a Nacos‑Prometheus Adapter for Dynamic Service Discovery in Go

This article walks through the core code of a Nacos‑Prometheus adapter, explaining how it connects to Nacos, retrieves service and instance data, formats it into Prometheus http_sd JSON, and serves it via an HTTP endpoint, enabling dynamic service discovery for monitoring.

GoMonitoringNacos

0 likes · 6 min read

How to Build a Nacos‑Prometheus Adapter for Dynamic Service Discovery in Go

Ops Community

Jul 6, 2025 · Operations

Master KVM Production Deployment: Real-World Ops Guide & Automation Scripts

This comprehensive guide walks you through KVM virtualization platform deployment in production, covering host preparation, VM creation, advanced networking, storage pool management, performance tuning, monitoring, and automated operational scripts to build a stable and efficient virtualized environment.

DeploymentKVMLinux

0 likes · 37 min read

Master KVM Production Deployment: Real-World Ops Guide & Automation Scripts

Liangxu Linux

Jul 5, 2025 · Operations

Step‑by‑Step Guide to Installing and Configuring Nagios on CentOS 7

This tutorial walks through preparing a CentOS 7 virtual machine, configuring networking, setting up required packages, compiling and installing Nagios Core, adding the Nagios user and Apache integration, configuring the firewall, and finally installing and enabling Nagios plugins for full monitoring capabilities.

InstallationMonitoringNagios

0 likes · 8 min read

Step‑by‑Step Guide to Installing and Configuring Nagios on CentOS 7