Tagged articles
2194 articles
Page 3 of 22
ITPUB
ITPUB
Oct 3, 2025 · Big Data

How Qunar Travel Cut 2000 CPU Cores by Optimizing Kafka Production

This case study details how Qunar Travel's engineering team analyzed Kafka production bottlenecks during peak traffic, added targeted monitoring, tuned thread and batch parameters, and validated the changes through gray‑scale tests, ultimately saving about 2000 CPU cores across three clusters while reducing request volume and improving network and disk utilization.

Big DataCPU SavingsKafka
0 likes · 14 min read
How Qunar Travel Cut 2000 CPU Cores by Optimizing Kafka Production
Ops Community
Ops Community
Oct 2, 2025 · Operations

How to Fix Nginx 502 Bad Gateway Errors: A 90% Success Checklist

This article provides a comprehensive, step‑by‑step checklist for diagnosing and resolving Nginx 502 Bad Gateway errors, covering backend service verification, configuration checks, log analysis, resource monitoring, network troubleshooting, special scenarios, and long‑term preventive measures.

502Bad GatewayMonitoring
0 likes · 25 min read
How to Fix Nginx 502 Bad Gateway Errors: A 90% Success Checklist
MaGe Linux Operations
MaGe Linux Operations
Oct 1, 2025 · Operations

How Automated Ops Cut Service Restarts by 80% and Save Hours Daily

Discover a comprehensive automated operations framework that eliminates manual service restarts, reduces repetitive tasks by 80%, accelerates fault recovery from minutes to seconds, and boosts reliability through health checks, Kubernetes self‑healing, Systemd scripts, monitoring, and scalable deployment strategies.

AutomationMonitoringOperations
0 likes · 37 min read
How Automated Ops Cut Service Restarts by 80% and Save Hours Daily
MaGe Linux Operations
MaGe Linux Operations
Sep 30, 2025 · Cloud Native

How I Cut Kubernetes Troubleshooting Time from 30 Minutes to 3 Minutes

This article presents a complete, step‑by‑step method for reducing average Kubernetes fault‑diagnosis time from half an hour to under three minutes, covering the root causes of slow manual debugging, a one‑click diagnostic script, efficient kubectl shortcuts, visual tools, log aggregation, automated response workflows, and real‑world case studies.

AutomationDevOpsMonitoring
0 likes · 50 min read
How I Cut Kubernetes Troubleshooting Time from 30 Minutes to 3 Minutes
Ops Community
Ops Community
Sep 29, 2025 · Cloud Native

Enterprise Docker Deployment: From Zero to Production – A Complete Guide

This comprehensive guide walks through the evolution of container technology, explains Docker's core mechanisms, and presents enterprise‑grade architecture, deployment strategies, monitoring, security hardening, and real‑world case studies, helping ops engineers build efficient, scalable, and secure production‑ready Docker environments.

ContainerizationDockerEnterprise Deployment
0 likes · 19 min read
Enterprise Docker Deployment: From Zero to Production – A Complete Guide
Tech Freedom Circle
Tech Freedom Circle
Sep 28, 2025 · Backend Development

Midnight TODO That Nearly Crashed the Whole Department: A JVM Performance Tuning Case Study

During a midnight promotion launch, a forgotten TODO caused thread‑pool exhaustion and frequent Full GC, bringing down an e‑commerce service; the article presents a five‑step end‑to‑end JVM tuning methodology, from data collection to root‑cause verification and code fix, showing how to diagnose and resolve such incidents.

Full GCHeap DumpJVM
0 likes · 24 min read
Midnight TODO That Nearly Crashed the Whole Department: A JVM Performance Tuning Case Study
Architecture Breakthrough
Architecture Breakthrough
Sep 28, 2025 · Operations

How to Build an Organizational High‑Availability Mechanism for Banking IT Production Issues

This article outlines a comprehensive, step‑by‑step framework for establishing a high‑availability system in large‑scale banking IT, covering goal definition, logical architecture, service classification, key activity identification, capability upgrades, monitoring, emergency‑response asset creation, technical debt tracking, and periodic post‑mortem redesign.

MonitoringOperationsProcess Design
0 likes · 10 min read
How to Build an Organizational High‑Availability Mechanism for Banking IT Production Issues
Ray's Galactic Tech
Ray's Galactic Tech
Sep 26, 2025 · Operations

Master Spring Boot Admin: Real‑Time Monitoring for Microservices

Spring Boot Admin is an open‑source tool that provides real‑time health checks, JVM metrics, log management, environment inspection, JMX control, and customizable alerts for Spring Boot applications, and this guide explains its core features, architecture, quick setup, advanced security, notification, Actuator integration, and production best practices.

AdminJavaMonitoring
0 likes · 7 min read
Master Spring Boot Admin: Real‑Time Monitoring for Microservices
Ray's Galactic Tech
Ray's Galactic Tech
Sep 26, 2025 · Cloud Native

How to Deploy Production-Ready Spring Boot Apps on Kubernetes (V2 Guide)

Learn step-by-step how to prepare, containerize, and securely deploy a Spring Boot application on Kubernetes, covering health checks, metrics, logging, JVM tuning, multi-stage Docker builds, Helm-like resources, ConfigMaps, Secrets, Ingress, HPA, monitoring, CI/CD pipelines, and rollback strategies for production-grade reliability.

DockerKubernetesMonitoring
0 likes · 9 min read
How to Deploy Production-Ready Spring Boot Apps on Kubernetes (V2 Guide)
DevOps Operations Practice
DevOps Operations Practice
Sep 24, 2025 · Cloud Native

How to Seamlessly Transition from Traditional Ops to Cloud Native: A Practical Guide

This article outlines the fundamental differences between traditional operations and cloud‑native practices, presents a four‑step migration strategy—including containerization, Kubernetes adoption, monitoring overhaul, and cultural shift—and highlights common pitfalls and measurable outcomes for a successful digital transformation.

ContainerizationMonitoringdigital transformation
0 likes · 7 min read
How to Seamlessly Transition from Traditional Ops to Cloud Native: A Practical Guide
Ops Community
Ops Community
Sep 24, 2025 · Operations

How Ops Engineers Can Stop Online Outages in Minutes: A Proven Emergency Playbook

This article outlines why a solid incident‑response plan is critical, describes typical failure scenarios, introduces the 3‑5‑10 rule for rapid diagnosis and mitigation, provides ready‑to‑run scripts for system checks, traffic throttling, service rollback, and showcases automation, AIOps and chaos‑engineering techniques to turn reactive firefighting into proactive resilience.

Incident ResponseMonitoringaiops
0 likes · 18 min read
How Ops Engineers Can Stop Online Outages in Minutes: A Proven Emergency Playbook
MaGe Linux Operations
MaGe Linux Operations
Sep 24, 2025 · Operations

How a 3 AM MySQL Crash Taught Me Essential Ops Lessons

This article recounts a 3 AM MySQL outage, analyzes its root causes, and shares comprehensive operational strategies—including index optimization, connection‑pool tuning, slow‑query fixing, replication lag handling, monitoring metrics, automation scripts, performance tuning, security hardening, and future trends—to help DBAs prevent and resolve similar incidents.

AutomationDatabase operationsMonitoring
0 likes · 15 min read
How a 3 AM MySQL Crash Taught Me Essential Ops Lessons
macrozheng
macrozheng
Sep 23, 2025 · Operations

How a Visual Bash Script Can Simplify SpringBoot Service Management and Deployment

Manual start‑stop, unclear status, scattered logs and risky rollbacks make SpringBoot production deployments error‑prone, while a visual, configuration‑driven Bash manager provides an intuitive UI, real‑time monitoring, intelligent start/stop, batch operations and automated deployment to dramatically improve efficiency and reliability.

Bash scriptDeployment AutomationMonitoring
0 likes · 22 min read
How a Visual Bash Script Can Simplify SpringBoot Service Management and Deployment
Java One
Java One
Sep 21, 2025 · Operations

Mastering Prometheus rate, irate, and increase: When and How to Use Each

This article explains how Prometheus’s rate, irate, and increase functions calculate counter growth rates, handle counter resets, and differ in smoothing and responsiveness, guiding you to choose the appropriate function for monitoring request rates, CPU usage, and other metrics.

MetricsMonitoringPrometheus
0 likes · 7 min read
Mastering Prometheus rate, irate, and increase: When and How to Use Each
IT Architects Alliance
IT Architects Alliance
Sep 20, 2025 · Operations

Mastering Microservice Governance: Tracing, Config, and Monitoring Strategies

This article explores the three core challenges of microservice governance—distributed tracing, centralized configuration management, and comprehensive monitoring—offering practical solutions, tool comparisons, and best‑practice guidelines to help architects build reliable, observable, and maintainable systems.

Cloud NativeConfiguration ManagementMonitoring
0 likes · 12 min read
Mastering Microservice Governance: Tracing, Config, and Monitoring Strategies
Ops Community
Ops Community
Sep 19, 2025 · Operations

From Midnight Outage to Zero Downtime: Mastering NFS High‑Availability

This article recounts a critical NFS failure that caused massive loss, then walks through practical high‑availability designs—including Keepalived + DRBD, GlusterFS migration, and cloud‑native CSI storage—while sharing real‑world pitfalls, monitoring strategies, and forward‑looking recommendations for resilient file‑system operations.

Distributed File SystemMonitoringNFS
0 likes · 12 min read
From Midnight Outage to Zero Downtime: Mastering NFS High‑Availability
MaGe Linux Operations
MaGe Linux Operations
Sep 17, 2025 · Operations

Unlock 5 CI/CD Ops Secrets to Triple Deployment Speed

This comprehensive guide reveals essential CI/CD operational techniques—from pipeline bottleneck detection and Docker multi‑stage builds to parallel execution, smart testing, blue‑green and canary deployments, full‑stack monitoring, cost‑saving cloud strategies, and a real‑world e‑commerce case study—helping teams dramatically boost efficiency, reliability, and security.

AutomationDockerKubernetes
0 likes · 46 min read
Unlock 5 CI/CD Ops Secrets to Triple Deployment Speed
Linux Tech Enthusiast
Linux Tech Enthusiast
Sep 16, 2025 · Operations

A Comprehensive Guide to Linux Performance Optimization

This article provides an in‑depth, step‑by‑step walkthrough of Linux performance optimization, covering key metrics such as throughput and latency, how to interpret average load, CPU and memory usage, context‑switch analysis, common bottlenecks, and the most effective tools (vmstat, pidstat, perf, strace, dstat, etc.) with concrete command examples and real‑world case studies to help you diagnose and resolve performance issues.

MonitoringOptimizationperformance
0 likes · 36 min read
A Comprehensive Guide to Linux Performance Optimization
DevOps Coach
DevOps Coach
Sep 15, 2025 · Operations

10 Underrated Linux Tools Every Sysadmin Should Master

This guide presents ten lesser‑known but powerful Linux utilities—such as at, systemd‑run, tuned, lsof/ss, journalctl, chattr, MOTD/issue, watch/diff, strace/ltrace, and hidden cron checks—each with practical examples to boost daily sysadmin efficiency and confidence.

AutomationLinuxMonitoring
0 likes · 7 min read
10 Underrated Linux Tools Every Sysadmin Should Master
Raymond Ops
Raymond Ops
Sep 14, 2025 · Operations

Mastering Concurrency: Optimize Nginx, HAProxy & Keepalived for High‑Performance Servers

This article explains the fundamentals of concurrency, distinguishes connections from requests, shows how to calculate and tune maximum concurrent connections for Nginx and HAProxy, covers system resource limits, demonstrates real‑time monitoring with stub_status, and provides practical load‑testing and Prometheus monitoring guidance.

AB testingConcurrencyHAProxy
0 likes · 15 min read
Mastering Concurrency: Optimize Nginx, HAProxy & Keepalived for High‑Performance Servers
Ops Community
Ops Community
Sep 14, 2025 · Operations

Boost Linux Ops 10×: Master Systemd Service Management from Beginner to Pro

This comprehensive guide walks you through Systemd fundamentals, core architecture, unit types, practical service creation, socket activation, timer units, performance tuning, resource control, security hardening, debugging, and production best practices, empowering Linux administrators to dramatically improve service management efficiency and reliability.

MonitoringService ManagementSystemd
0 likes · 28 min read
Boost Linux Ops 10×: Master Systemd Service Management from Beginner to Pro
Java Tech Enthusiast
Java Tech Enthusiast
Sep 14, 2025 · Operations

How to Use Java Agent for Non‑Intrusive SpringBoot Monitoring

Learn how to implement a Java Agent that enables non‑intrusive monitoring of SpringBoot applications, covering agent basics, bytecode manipulation with Byte Buddy, metric collection via Micrometer, Prometheus/Grafana integration, and advanced extensions such as JVM metrics, HTTP client tracing, and distributed tracing.

MicrometerMonitoringPrometheus
0 likes · 16 min read
How to Use Java Agent for Non‑Intrusive SpringBoot Monitoring
Architect
Architect
Sep 10, 2025 · Operations

Building System Stability: A Backend Engineer’s Guide to Risk Management

This article explores system stability from a backend perspective, defining its academic and engineering meanings, quantifying metrics like SLA, MTBF and MTTR, analyzing why stability matters, outlining the challenges faced, and presenting practical steps—including resource consensus, goal setting, awareness cultivation, production standards, monitoring, emergency response, and regular inspections—to effectively build and maintain stable systems.

MonitoringOperationsrisk management
0 likes · 25 min read
Building System Stability: A Backend Engineer’s Guide to Risk Management
NiuNiu MaTe
NiuNiu MaTe
Sep 10, 2025 · Backend Development

How to Quickly Resolve Message Queue Backlog and Keep Your System Stable

This article explains what message queue backlog is, why it harms system latency, and provides practical, step‑by‑step strategies—including temporary consumer scaling, prioritizing core messages, queue splitting, root‑cause analysis, performance tuning, message design, dead‑letter handling, traffic control, capacity planning, and monitoring—to eliminate backlog and ensure reliable asynchronous processing.

BacklogMessage QueueMonitoring
0 likes · 21 min read
How to Quickly Resolve Message Queue Backlog and Keep Your System Stable
Ops Community
Ops Community
Sep 8, 2025 · Operations

Mastering Distributed Log Architecture: From Flume to ELK and Beyond

This comprehensive guide walks you through the challenges of large‑scale log collection, real‑time processing, storage optimization, and visualization, detailing practical configurations for Flume, Logstash, Elasticsearch, Kibana, Filebeat, Kafka, Kubernetes, and future AIOps integrations to build a reliable, cost‑effective distributed logging system.

ELKFlumeKafka
0 likes · 24 min read
Mastering Distributed Log Architecture: From Flume to ELK and Beyond
MaGe Linux Operations
MaGe Linux Operations
Sep 7, 2025 · Databases

Master MySQL Slow Query Analysis: Proven SQL Optimization Techniques to Boost Performance

This comprehensive guide walks you through diagnosing MySQL slow queries, from identifying root causes and configuring slow‑query logs to applying advanced indexing, query‑rewriting, and monitoring techniques—complete with real‑world case studies that demonstrate how to cut query times from seconds to milliseconds.

MonitoringMySQLSQL optimization
0 likes · 28 min read
Master MySQL Slow Query Analysis: Proven SQL Optimization Techniques to Boost Performance
Architect
Architect
Sep 6, 2025 · Operations

Master High-Concurrency Nginx: Core Configs, Advanced Tuning, and Real-World Checklist

This guide walks you through the common high‑traffic pain points of Nginx, explains why configuration and tuning matter more than hardware, and provides step‑by‑step core, advanced, OS‑level, monitoring, and troubleshooting configurations to reliably handle tens of thousands of concurrent connections.

LinuxMonitoringPerformance tuning
0 likes · 11 min read
Master High-Concurrency Nginx: Core Configs, Advanced Tuning, and Real-World Checklist
Ops Community
Ops Community
Sep 4, 2025 · Databases

Avoid Redis Nightmares: Proven Deployment and Optimization Guide

This comprehensive guide walks you through Redis production deployment, persistence strategies, performance tuning, security hardening, real‑world case studies, and failure recovery, helping you prevent common pitfalls and keep your cache layer reliable and fast.

MonitoringOptimizationPersistence
0 likes · 21 min read
Avoid Redis Nightmares: Proven Deployment and Optimization Guide
dbaplus Community
dbaplus Community
Sep 3, 2025 · Operations

How to Build System Stability: Definitions, Challenges, and Practical Steps

This article explains what system stability means, why it matters, the difficulties of building it, and provides a detailed, step‑by‑step framework—including risk formulas, resource planning, monitoring, and emergency response—to help backend teams improve reliability and reduce business impact.

Incident ResponseMonitoringrisk management
0 likes · 23 min read
How to Build System Stability: Definitions, Challenges, and Practical Steps
ITPUB
ITPUB
Sep 3, 2025 · Backend Development

How We Boosted Kafka Throughput by 35% with Filebeat Tuning and Compression Tricks

This case study details how a high‑traffic Kafka logging cluster was optimized by analyzing low compression ratios, tuning Filebeat parameters, adjusting memory queues and round‑robin settings, and validating the changes through gray‑scale tests, resulting in up to 35% higher throughput and significant resource savings.

FilebeatKafkaMonitoring
0 likes · 10 min read
How We Boosted Kafka Throughput by 35% with Filebeat Tuning and Compression Tricks
dbaplus Community
dbaplus Community
Sep 1, 2025 · Operations

How to Keep VictoriaMetrics Stable During Sudden Metric Surges

This article outlines practical strategies for protecting VictoriaMetrics storage under bursty metric traffic, covering communication with business teams, splitting deployments, choosing single‑node versus cluster setups, key monitoring metrics, separate storage for self‑monitoring, the VMUI Explore UI, and techniques for discarding high‑cardinality metrics.

MetricsMonitoringVictoriaMetrics
0 likes · 10 min read
How to Keep VictoriaMetrics Stable During Sudden Metric Surges
Java Architect Essentials
Java Architect Essentials
Aug 31, 2025 · Backend Development

How Global Exception Handling Can Slash Crash Rates by 90% in Java Services

This article explains why uncaught exceptions can cripple a Java backend, demonstrates a three‑layer global exception handling strategy with Spring Boot, shows how circuit‑breaker rules further protect services, and provides real‑world data proving crash rates can drop from over 4% to under 0.1%.

Backend DevelopmentCircuit BreakerException Handling
0 likes · 8 min read
How Global Exception Handling Can Slash Crash Rates by 90% in Java Services
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Aug 30, 2025 · Operations

INFINI Console FAQ: Enterprise‑Grade Unified Elasticsearch Management

The article introduces INFINI Console, an open‑source, lightweight platform for unified, multi‑cluster and cross‑version Elasticsearch governance, compares it with Kibana, details deployment options, enterprise‑level features such as monitoring, alerting and security, and analyzes cost advantages and practical migration scenarios.

Cluster ManagementElasticsearchINFINI Console
0 likes · 13 min read
INFINI Console FAQ: Enterprise‑Grade Unified Elasticsearch Management
Ops Community
Ops Community
Aug 30, 2025 · Information Security

Master Linux Server Hardening: From Manual Steps to Automated Scripts

This comprehensive guide walks you through Linux server security hardening, covering real-world incident analysis, a detailed checklist of system, SSH, firewall, kernel and logging configurations, plus ready-to-use Bash scripts, Ansible playbooks, Docker hardening, monitoring tools, and actionable steps to build an enterprise‑grade defense.

DockerHardeningLinux
0 likes · 17 min read
Master Linux Server Hardening: From Manual Steps to Automated Scripts
Code Mala Tang
Code Mala Tang
Aug 30, 2025 · Backend Development

How to Log API Requests Without Slowing Down Your Server

Effective API logging is essential for debugging and compliance, but naive synchronous logging can block the event loop, exhaust disk I/O, and degrade performance; this guide explains why, and provides ten practical steps—including asynchronous loggers, buffering, offloading, sensitive data masking, and monitoring—to keep your server fast and reliable.

API loggingAsynchronousLog Management
0 likes · 15 min read
How to Log API Requests Without Slowing Down Your Server
MaGe Linux Operations
MaGe Linux Operations
Aug 29, 2025 · Operations

How to Supercharge Nginx for Millions of QPS: A Complete Guide

Discover proven strategies to optimize Nginx under extreme traffic, covering benchmark testing, kernel tuning, configuration tweaks, caching, load balancing, SSL hardening, monitoring, and real-world case studies that demonstrate how to achieve stable high‑QPS performance while minimizing latency and resource usage.

MonitoringOptimizationhigh-concurrency
0 likes · 22 min read
How to Supercharge Nginx for Millions of QPS: A Complete Guide
ITPUB
ITPUB
Aug 29, 2025 · Operations

Why Operations Engineers Are Anything But Low‑Skill: A Deep Dive into Their Real Technical Challenges

The article debunks the myth that operations work is low‑skill by detailing the extensive monitoring, Linux, networking, security, and firefighting expertise required, illustrating real‑world scenarios, tools, and best‑practice recommendations that highlight the critical, high‑level technical role of ops engineers.

DevOpsLinuxMonitoring
0 likes · 17 min read
Why Operations Engineers Are Anything But Low‑Skill: A Deep Dive into Their Real Technical Challenges
Raymond Ops
Raymond Ops
Aug 28, 2025 · Operations

Step-by-Step Guide to Install, Configure, and Use Prometheus for Monitoring

This tutorial walks you through downloading Prometheus, setting up self‑monitoring, starting the server, opening firewall ports, exploring the built‑in UI, adding Node Exporter targets, configuring scrape jobs, creating recording rules, and visualizing metrics with queries and graphs.

MonitoringPrometheusRecording Rules
0 likes · 10 min read
Step-by-Step Guide to Install, Configure, and Use Prometheus for Monitoring
MaGe Linux Operations
MaGe Linux Operations
Aug 24, 2025 · Operations

Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox

This comprehensive guide shares a veteran ops engineer's real‑world troubleshooting mindset, the SEAL framework, a curated toolbox of monitoring, logging, performance, and network utilities, detailed case studies, incident‑response grading, automation scripts, and future‑ready AIOps practices for keeping production systems stable.

AutomationIncident ResponseMonitoring
0 likes · 19 min read
Master Production Incident Troubleshooting: SEAL Methodology & Essential Ops Toolbox
MaGe Linux Operations
MaGe Linux Operations
Aug 21, 2025 · Operations

Master Docker Volume Management: From Basics to Enterprise‑Grade Persistence & Backup

This comprehensive guide walks you through Docker storage challenges, explains temporary, bind‑mount and named volumes, presents tiered storage architectures and dynamic scripts, and provides production‑grade backup, monitoring, and performance‑tuning strategies to ensure reliable data persistence in containerized environments.

Monitoringbackupops
0 likes · 13 min read
Master Docker Volume Management: From Basics to Enterprise‑Grade Persistence & Backup
Linux Ops Smart Journey
Linux Ops Smart Journey
Aug 20, 2025 · Operations

How to Turn Abstract Metrics into Intuitive Gauges with Grafana

This guide explains why Grafana's Gauge panel creates a powerful visual metaphor for system pressure, walks through creating the gauge, configuring PromQL queries, setting panel options, thresholds, and JSON definitions, and shows how to produce clear, boss‑friendly monitoring dashboards.

Gauge panelGrafanaJSON configuration
0 likes · 5 min read
How to Turn Abstract Metrics into Intuitive Gauges with Grafana
Tech Freedom Circle
Tech Freedom Circle
Aug 20, 2025 · Backend Development

P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

During the Double‑11 shopping festival, a leading e‑commerce platform suffered a P0 outage when its Eureka service‑discovery cluster overloaded, triggering a full‑chain failure that lasted 2 hours 42 minutes and caused losses exceeding 1.2 billion yuan; the article dissects the timeline, root causes, capacity mis‑planning, monitoring gaps, and remediation strategies.

JavaMonitoringcapacity planning
0 likes · 34 min read
P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11
Wukong Talks Architecture
Wukong Talks Architecture
Aug 19, 2025 · Backend Development

From Monolith to Microservices: A Real‑World Online Supermarket Migration Story

This article walks through the evolution of an online supermarket from a simple monolithic website to a fully‑featured microservice architecture, highlighting the challenges, design decisions, component choices, monitoring, tracing, testing, and the trade‑offs of service mesh versus custom frameworks.

DeploymentMonitoringarchitecture
0 likes · 22 min read
From Monolith to Microservices: A Real‑World Online Supermarket Migration Story
MaGe Linux Operations
MaGe Linux Operations
Aug 19, 2025 · Big Data

Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies

This article provides a comprehensive guide to building enterprise‑grade, highly available Kafka clusters, covering architecture design, hardware planning, production‑level broker configurations, ISR management, monitoring, fault‑tolerance procedures, rolling upgrades, capacity planning, and automation scripts for seamless operations.

KafkaMonitoringOperations
0 likes · 16 min read
Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies
Ops Community
Ops Community
Aug 19, 2025 · Information Security

Master Linux Security: Advanced firewalld Rules & SELinux Context Management

This guide walks you through hardening Linux servers by using firewalld's zone‑based advanced rules, rich rules, and IPSET collections, combined with precise SELinux context management, practical scripts, troubleshooting tips, and production‑grade best practices to build a multi‑layered defense.

AutomationLinuxMonitoring
0 likes · 11 min read
Master Linux Security: Advanced firewalld Rules & SELinux Context Management
Cognitive Technology Team
Cognitive Technology Team
Aug 19, 2025 · Operations

How Bilibili Scaled Server Fault Management with Automated Detection and Repair

This article details Bilibili's evolving server fault management architecture, covering fault classification, the shortcomings of manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerts, and end‑to‑end repair automation.

MonitoringOperationsin‑band collection
0 likes · 18 min read
How Bilibili Scaled Server Fault Management with Automated Detection and Repair
Linux Ops Smart Journey
Linux Ops Smart Journey
Aug 14, 2025 · Operations

Master Grafana Time Series Panel: From Basics to Advanced Configuration

This guide explains why Grafana’s Time Series panel is essential for proactive monitoring, walks through browser selection, PromQL queries, panel options such as titles, tooltips, legends, axes, graph styles, and provides a ready‑to‑use JSON configuration to visualize trends and detect anomalies.

GrafanaMonitoringOperations
0 likes · 8 min read
Master Grafana Time Series Panel: From Basics to Advanced Configuration
DevOps Operations Practice
DevOps Operations Practice
Aug 11, 2025 · Operations

Zen Master’s Secrets to the Ultimate State of Operations

Through a series of dialogues with a Zen master, the article humorously explores the highest level of operations—automation that runs itself, balanced alerting, cloud migration, reliable backups, high‑availability, stability through chaos engineering, and the ultimate goal of making systems operate without human intervention.

AutomationMonitoringOperations
0 likes · 5 min read
Zen Master’s Secrets to the Ultimate State of Operations
Liangxu Linux
Liangxu Linux
Aug 10, 2025 · Databases

Master MySQL Backup & Recovery: Complete Guide for Reliable Data Protection

This comprehensive guide explains MySQL data backup and recovery strategies, covering backup types, planning principles, built‑in tools like mysqldump and mysqlpump, third‑party solutions such as Percona XtraBackup, scripting for automated schedules, storage options, encryption, monitoring, troubleshooting, and best‑practice recommendations to ensure data safety and business continuity.

AutomationDatabaseMonitoring
0 likes · 22 min read
Master MySQL Backup & Recovery: Complete Guide for Reliable Data Protection
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 7, 2025 · Operations

How to Collect and Analyze JuiceFS Access Logs with Volcengine TLS

This article explains how to gather JuiceFS access logs using the LogCollector agent, parse and structure them with TLS, design index fields, build analytical dashboards, run advanced SQL queries for write‑IO distribution, sequential‑read ratios, overwrite detection, file‑lifecycle analysis, and set up real‑time monitoring and alerting for performance anomalies.

JuiceFSLogCollectorMonitoring
0 likes · 22 min read
How to Collect and Analyze JuiceFS Access Logs with Volcengine TLS
dbaplus Community
dbaplus Community
Aug 5, 2025 · Backend Development

10 Logging Best Practices to Diagnose Production Issues Efficiently

This article presents ten practical rules for writing high‑quality logs—covering format consistency, stack traces, log levels, parameter completeness, asynchronous handling, traceability, dynamic configuration, structured storage, and intelligent monitoring—to help engineers quickly pinpoint problems in high‑traffic systems.

LoggingMonitoringlogback
0 likes · 9 min read
10 Logging Best Practices to Diagnose Production Issues Efficiently
JakartaEE China Community
JakartaEE China Community
Aug 5, 2025 · Operations

How to Monitor Java Virtual Threads Effectively

This article explains the internal mechanics of Java virtual threads, the role of Continuation, pinned threads, and carrier threads, and provides concrete monitoring techniques using JVM flags, JFR events, and framework-specific considerations for Helidon and Quarkus.

ForkJoinPoolHelidonJFR
0 likes · 11 min read
How to Monitor Java Virtual Threads Effectively
Architecture Breakthrough
Architecture Breakthrough
Jul 28, 2025 · Operations

Turn Point Fixes into Systemic Solutions: A Practical Optimization Framework

Effective technical optimization requires moving from isolated, point‑style ideas to a comprehensive, measurable framework that quantifies goals, assesses gaps, designs capacity, monitors key services and links, and establishes clear compensation and incident‑handling procedures, ensuring a complete, closed‑loop solution.

MonitoringOperationscapacity planning
0 likes · 8 min read
Turn Point Fixes into Systemic Solutions: A Practical Optimization Framework
MaGe Linux Operations
MaGe Linux Operations
Jul 25, 2025 · Operations

5 Game‑Changing One‑Liner Shell Commands Every Ops Engineer Must Know

This article shares five battle‑tested one‑line Shell commands that instantly diagnose server health, analyze logs, rank process resources, troubleshoot network connections, and clean disk space, plus practical tips and mindset advice to help operations engineers solve critical incidents faster and more reliably.

LinuxMonitoringOne-liner
0 likes · 10 min read
5 Game‑Changing One‑Liner Shell Commands Every Ops Engineer Must Know
dbaplus Community
dbaplus Community
Jul 24, 2025 · Operations

How Bilibili Scales Server Fault Management with Automated Detection and Repair

This article details Bilibili's approach to handling explosive growth in server count by classifying faults, identifying shortcomings of manual processes, and implementing an automated, end‑to‑end detection, rule‑based alerting, and repair workflow that combines in‑band and out‑of‑band data collection to achieve near‑perfect coverage and accuracy.

Data CenterMonitoringfault detection
0 likes · 17 min read
How Bilibili Scales Server Fault Management with Automated Detection and Repair
Ops Community
Ops Community
Jul 24, 2025 · Operations

How a Small E‑commerce Site Scaled to 10 Million Daily Visits: Real‑World Architecture Lessons

This article details a small‑to‑mid‑size e‑commerce platform’s journey from a few thousand daily page views to ten million, covering business challenges, three architecture evolution stages, key technical solutions, performance optimizations, cost‑control strategies, and practical automation tips.

MonitoringOperationsPerformance Optimization
0 likes · 14 min read
How a Small E‑commerce Site Scaled to 10 Million Daily Visits: Real‑World Architecture Lessons
Ops Community
Ops Community
Jul 23, 2025 · Operations

Why Did My JVM Show 900% CPU? Uncovering Container Limit Misconfigurations

An 8‑year ops veteran investigates a night‑time alert showing 900% CPU usage, discovers that a JVM inside a Kubernetes pod misreads host cores while the container is limited to two CPUs, and outlines how improper thread‑pool settings and monitoring metrics caused massive throttling before presenting concrete fixes.

CPU throttlingJVMKubernetes
0 likes · 10 min read
Why Did My JVM Show 900% CPU? Uncovering Container Limit Misconfigurations
MaGe Linux Operations
MaGe Linux Operations
Jul 23, 2025 · Operations

How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery

This article details a P0 production incident where a Kubernetes cluster became completely unresponsive due to 100% etcd database fragmentation, describing the step‑by‑step diagnosis, emergency recovery actions, root‑cause analysis, and long‑term preventive measures for reliable cluster operation.

Cluster RecoveryKubernetesMonitoring
0 likes · 12 min read
How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery
High Availability Architecture
High Availability Architecture
Jul 22, 2025 · Operations

How We Automated Server Fault Detection and Repair at Scale

This article explains the challenges of managing rapidly growing server fleets, outlines a systematic classification of hardware and software faults, and details an end‑to‑end automated solution that combines in‑band and out‑of‑band data collection, rule‑based detection, and fully automated repair workflows to improve fault coverage, accuracy, and recovery speed.

MonitoringOperationshardware detection
0 likes · 16 min read
How We Automated Server Fault Detection and Repair at Scale
Architect's Guide
Architect's Guide
Jul 21, 2025 · Operations

How to Achieve Five Nines: Practical High‑Availability Strategies for Modern Web Systems

This article explains key high‑availability concepts such as availability metrics, microservice modularization, load balancing, rate limiting, circuit breaking, isolation, retry strategies, rollback plans, stress testing, monitoring, and on‑call processes, providing concrete design guidelines for building resilient internet services.

Circuit BreakingMonitoringRate Limiting
0 likes · 12 min read
How to Achieve Five Nines: Practical High‑Availability Strategies for Modern Web Systems
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 21, 2025 · Operations

Create an AI Ops Assistant Using Elasticsearch for Real‑Time Monitoring & NL Queries

This guide explains how to build an AI‑powered operations assistant with Elasticsearch that provides real‑time monitoring, natural‑language query translation, end‑to‑end automation, and lower technical barriers, covering architecture, one‑click deployment, validation steps, and resource cleanup.

AI OpsElasticsearchMonitoring
0 likes · 7 min read
Create an AI Ops Assistant Using Elasticsearch for Real‑Time Monitoring & NL Queries
Code Mala Tang
Code Mala Tang
Jul 18, 2025 · Backend Development

Unlock Lightning-Fast Node.js: 8 Proven Backend Performance Hacks

Discover why a sluggish API hurts user retention, SEO, and costs, and learn eight practical Node.js backend optimization techniques—including mastering the event loop, avoiding blocking code, leveraging async/await, offloading heavy tasks, efficient JSON handling, caching strategies, database tuning, clustering, and continuous monitoring—to boost performance and scalability.

Backend PerformanceCachingMonitoring
0 likes · 8 min read
Unlock Lightning-Fast Node.js: 8 Proven Backend Performance Hacks
Ops Development & AI Practice
Ops Development & AI Practice
Jul 18, 2025 · Operations

Mastering Modern Software Operations: The Six Essential Steps for Success

Modern software operations have shifted from a post‑launch checklist to an ongoing, automated discipline, and this article outlines the six core phases—requirement planning, CI/CD automation, comprehensive monitoring, incident response, performance tuning, and security compliance—providing concrete examples and practical advice for building a resilient DevOps culture.

DevOpsIncident ManagementMonitoring
0 likes · 9 min read
Mastering Modern Software Operations: The Six Essential Steps for Success
MaGe Linux Operations
MaGe Linux Operations
Jul 17, 2025 · Operations

Master Network Device Ops: Switches, Routers, and Firewalls Deep Dive

This comprehensive guide walks network engineers through the fundamentals and advanced techniques for operating switches, routers, and firewalls, covering configuration, performance monitoring, troubleshooting, automation, security hardening, and emerging trends like SDN and AI-driven operations.

AutomationMonitoringSwitch Configuration
0 likes · 26 min read
Master Network Device Ops: Switches, Routers, and Firewalls Deep Dive
Efficient Ops
Efficient Ops
Jul 14, 2025 · Operations

Rescuing a Critical CPU Outage: My Step-by-Step Troubleshooting Guide

After a midnight CPU alarm threatened service stability, I walked through rapid diagnosis with top and htop, identified JVM bottlenecks using jstat and async‑profiler, refactored a Java sorting algorithm, added caching, optimized database queries, containerized the service, and set up Prometheus‑Grafana alerts to prevent future incidents.

CPU troubleshootingDockerJava performance
0 likes · 7 min read
Rescuing a Critical CPU Outage: My Step-by-Step Troubleshooting Guide
Efficient Ops
Efficient Ops
Jul 13, 2025 · Operations

Mastering Modern System Operations: 6 Essential Strategies for Stability and Efficiency

This comprehensive guide outlines six critical areas of modern system operations—including real‑time monitoring, security safeguards, automation, fault diagnosis, collaborative teamwork, and process optimization—offering practical strategies and tools such as Zabbix, Prometheus, ELK, Redis, Ansible, and capacity planning to ensure stable, efficient enterprise services.

AutomationMonitoringSecurity
0 likes · 10 min read
Mastering Modern System Operations: 6 Essential Strategies for Stability and Efficiency
MaGe Linux Operations
MaGe Linux Operations
Jul 12, 2025 · Operations

Mastering EFK: The Complete Guide to Building a Scalable Log Management System

This comprehensive guide explains the EFK (Elasticsearch, Fluentd, Kibana) log management stack, covering its components, architecture, deployment steps, log collection strategies, index optimization, monitoring, security hardening, troubleshooting and best‑practice recommendations for building a reliable, scalable logging solution in modern cloud‑native environments.

DockerEFKElasticsearch
0 likes · 17 min read
Mastering EFK: The Complete Guide to Building a Scalable Log Management System
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Jul 10, 2025 · Operations

Ensuring Elasticsearch Stability: Testing, Performance, and Disaster Recovery

This article outlines a comprehensive reliability framework for Elasticsearch, covering pre‑release performance evaluation, data accuracy checks, real‑time sync delay alerts, rapid recovery strategies, performance testing methods, and disaster‑recovery measures such as multi‑cluster backup and index alias switching.

Monitoringdata synchronizationdisaster recovery
0 likes · 12 min read
Ensuring Elasticsearch Stability: Testing, Performance, and Disaster Recovery
Zhuanzhuan Tech
Zhuanzhuan Tech
Jul 9, 2025 · Operations

How Apache HertzBeat Enables Agent‑Free Real‑Time Monitoring and Alerting

This guide introduces Apache HertzBeat, an open‑source real‑time monitoring and alerting platform that requires no agents, supports high‑performance clusters, offers customizable protocols, integrates with Grafana, provides plugin hot‑updates, and details its time‑wheel scheduling, cloud‑edge collaboration, and alert configuration.

ApacheClusterHertzBeat
0 likes · 22 min read
How Apache HertzBeat Enables Agent‑Free Real‑Time Monitoring and Alerting
Ops Community
Ops Community
Jul 6, 2025 · Operations

Master KVM Production Deployment: Real-World Ops Guide & Automation Scripts

This comprehensive guide walks you through KVM virtualization platform deployment in production, covering host preparation, VM creation, advanced networking, storage pool management, performance tuning, monitoring, and automated operational scripts to build a stable and efficient virtualized environment.

DeploymentKVMLinux
0 likes · 37 min read
Master KVM Production Deployment: Real-World Ops Guide & Automation Scripts
Liangxu Linux
Liangxu Linux
Jul 5, 2025 · Operations

Step‑by‑Step Guide to Installing and Configuring Nagios on CentOS 7

This tutorial walks through preparing a CentOS 7 virtual machine, configuring networking, setting up required packages, compiling and installing Nagios Core, adding the Nagios user and Apache integration, configuring the firewall, and finally installing and enabling Nagios plugins for full monitoring capabilities.

InstallationMonitoringNagios
0 likes · 8 min read
Step‑by‑Step Guide to Installing and Configuring Nagios on CentOS 7