Tagged articles

2195 articles

Page 4 of 22

Jul 5, 2025 · Operations

Step‑by‑Step Guide to Installing and Configuring Nagios on CentOS 7

This tutorial walks through preparing a CentOS 7 virtual machine, configuring networking, setting up required packages, compiling and installing Nagios Core, adding the Nagios user and Apache integration, configuring the firewall, and finally installing and enabling Nagios plugins for full monitoring capabilities.

InstallationNagiosSystem Administration

0 likes · 8 min read

Step‑by‑Step Guide to Installing and Configuring Nagios on CentOS 7

Java Architect Essentials

Jul 4, 2025 · Backend Development

Avoid Dependency Nightmares: Best Practices for Building Reusable Spring Boot Starters

The article shares real‑world experiences and step‑by‑step guidelines for creating robust, modular Spring Boot starters—especially for logging and monitoring—covering dependency conflict detection, strict dependency scopes, SPI design, configuration conventions, documentation standards to dramatically improve reuse and reduce integration headaches.

Custom StarterLoggingSpring Boot

0 likes · 11 min read

Avoid Dependency Nightmares: Best Practices for Building Reusable Spring Boot Starters

37 Interactive Technology Team

Jul 4, 2025 · Operations

How Dynamic Thresholds with Prophet Transform Monitoring from Static Alerts to Intelligent Insights

Traditional fixed‑threshold monitoring often triggers noisy alerts during routine business rhythms, but by modeling time‑series patterns with Facebook Prophet to predict dynamic confidence intervals, teams can automatically adjust thresholds, reduce false positives, and accurately detect true anomalies across diverse services.

Anomaly DetectionProphetTime-series

0 likes · 7 min read

How Dynamic Thresholds with Prophet Transform Monitoring from Static Alerts to Intelligent Insights

Big Data Tech Team

Jul 3, 2025 · Big Data

Master Kafka: A Complete Learning Roadmap from Basics to Advanced Projects

This guide presents a step‑by‑step Kafka learning roadmap covering core concepts, architecture, configuration, monitoring tools, practical project ideas, advanced components like Streams and KSQL, plus code samples and resource recommendations to help beginners become proficient in real‑time data streaming.

Code ExamplesKafkaStreaming

0 likes · 14 min read

Master Kafka: A Complete Learning Roadmap from Basics to Advanced Projects

Linux Ops Smart Journey

Jul 3, 2025 · Cloud Native

How to Visualize Kubernetes Namespace Resource Usage with Prometheus

This guide walks you through deploying kube-state-metrics, configuring Prometheus to collect CPU, memory and other resource metrics per Kubernetes namespace, setting up ResourceQuota and LimitRange visualizations, and verifying data collection with Helm, Docker, and curl commands, enabling comprehensive cluster health monitoring.

KubernetesPrometheusResourceQuota

0 likes · 7 min read

How to Visualize Kubernetes Namespace Resource Usage with Prometheus

Efficient Ops

Jul 2, 2025 · Operations

Master Grafana: Key Features, Installation on Linux & Docker

This guide introduces Grafana, outlines its multi‑source monitoring features, and provides step‑by‑step installation instructions for Linux using systemd and for Docker Compose, including required commands, configuration files, and how to create and save a basic dashboard.

DockerGrafanaInstallation

0 likes · 4 min read

Master Grafana: Key Features, Installation on Linux & Docker

Ops Development & AI Practice

Jul 2, 2025 · Operations

Master Alertmanager: Grouping, Inhibition, and Silencing to Tame Alert Storms

In modern cloud‑native environments, Prometheus Alertmanager offers powerful grouping, inhibition, and silencing features that reduce alert noise, help pinpoint root causes, and provide scheduled quiet periods, enabling teams to transform chaotic alert storms into manageable, actionable notifications.

AlertGroupingAlertmanagerInhibition

0 likes · 8 min read

Master Alertmanager: Grouping, Inhibition, and Silencing to Tame Alert Storms

Raymond Ops

Jul 2, 2025 · Operations

Master Linux Process Management: From Basics to Advanced Monitoring

This comprehensive guide explains what a process is, how it differs from a program, its lifecycle, and provides detailed instructions for monitoring process status with ps and top, using tools like vmstat, iostat, dstat, managing processes with kill, killall, pkill, background jobs, screen, adjusting priorities, and interpreting system load averages.

LinuxSystem Administrationmonitoring

0 likes · 29 min read

Master Linux Process Management: From Basics to Advanced Monitoring

DeWu Technology

Jun 30, 2025 · Operations

How to Build an Effective Asset‑Loss Prevention System for E‑Commerce Platforms

This article explains why asset‑loss (资损) prevention is critical for high‑value e‑commerce finance, outlines a step‑by‑step methodology covering pre‑, in‑ and post‑incident stages, rule discovery, measurement, implementation options, and operational best practices, and shares concrete results and visual diagrams.

asset losse-commercefinancial operations

0 likes · 18 min read

How to Build an Effective Asset‑Loss Prevention System for E‑Commerce Platforms

Linux Ops Smart Journey

Jun 30, 2025 · Operations

Automate Service Discovery: Seamlessly Connect Prometheus with Consul

This tutorial explains how to integrate Prometheus with Consul for automatic service discovery in cloud‑native environments, covering ACL policy creation, token generation, adding static scrape configurations via the Prometheus Operator, and verification steps to ensure reliable monitoring.

ConsulKubernetesPrometheus

0 likes · 4 min read

Automate Service Discovery: Seamlessly Connect Prometheus with Consul

Lin is Dream

Jun 24, 2025 · Backend Development

Master RocketMQ Console: From Zero to Full Monitoring in Minutes

This article walks you through installing and using the RocketMQ Dashboard to monitor topics, brokers, producers, consumers, and message details, explains common pitfalls such as client‑ID conflicts in Docker, and demonstrates how to troubleshoot consumption issues, TPS metrics, and dead‑letter handling.

JavaMessage QueueRocketMQ

0 likes · 9 min read

Master RocketMQ Console: From Zero to Full Monitoring in Minutes

dbaplus Community

Jun 23, 2025 · Operations

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

This article shares a year‑long, hands‑on experience of improving backend alert governance at Tencent Meeting, covering why alerts are hard, designing segmented error codes, building unified alert policies, driving team silence‑up, measuring progress, and the tools that make the process sustainable.

Alert ManagementIncident Responsebackend operations

0 likes · 42 min read

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

Mingyi World Elasticsearch

Jun 18, 2025 · Operations

Comprehensively Manage Elasticsearch 9.X with INFINI Console

The article provides a detailed technical overview of INFINI Console, an open‑source, lightweight governance platform that enables multi‑cluster, cross‑version management, dynamic registration, monitoring, alerting, and developer tools for Elasticsearch 9.X, comparing it with Kibana and highlighting deployment simplicity across various OS and CPU architectures.

Cluster ManagementCross-Version SupportDeployment

0 likes · 11 min read

Comprehensively Manage Elasticsearch 9.X with INFINI Console

DevOps Operations Practice

Jun 16, 2025 · Cloud Native

Mastering Kubernetes: 6 Essential Tools for Cluster Management

This article introduces six indispensable tools—kubectl, Helm, Prometheus + Grafana, Istio, Velero, and K9s—that simplify Kubernetes cluster management by covering resource handling, monitoring, networking, security, backup, and interactive UI, helping readers efficiently operate production‑grade clusters.

Cloud NativeCluster ManagementDevOps

0 likes · 7 min read

Mastering Kubernetes: 6 Essential Tools for Cluster Management

Linux Ops Smart Journey

Jun 16, 2025 · Cloud Native

Mastering PrometheusRule: Streamline Kubernetes Alerting & Recording

This article explains how PrometheusRule, a Kubernetes custom resource, simplifies the management of alerting and recording rules by centralizing configurations, reducing restarts, avoiding conflicts, and enabling version‑controlled, modular monitoring for cloud‑native environments.

Cloud NativeKubernetesPrometheus

0 likes · 6 min read

Mastering PrometheusRule: Streamline Kubernetes Alerting & Recording

Linux Ops Smart Journey

Jun 13, 2025 · Operations

Master ServiceMonitor: Build Reliable Prometheus Monitoring for Kubernetes

This article dives deep into ServiceMonitor, comparing it with traditional Prometheus configurations, detailing its core fields, and providing hands‑on examples for Harbor and GitLab metrics, enabling you to create stable, flexible, and maintainable monitoring setups for Kubernetes services.

KubernetesOperationsPrometheus

0 likes · 5 min read

Master ServiceMonitor: Build Reliable Prometheus Monitoring for Kubernetes

Liangxu Linux

Jun 12, 2025 · Operations

Essential Shell Scripts for Linux Ops: Alerts, MySQL Backup, and Traffic Monitoring

This article provides ready‑to‑use Shell scripts for Linux system alerts, automated MySQL database backups, and real‑time network interface traffic monitoring, offering sysadmins practical code snippets and configuration steps to streamline routine operations.

Automationbackupmonitoring

0 likes · 4 min read

Essential Shell Scripts for Linux Ops: Alerts, MySQL Backup, and Traffic Monitoring

Efficient Ops

Jun 11, 2025 · Operations

Master cURL: Essential Commands for DevOps, Monitoring, and Automation

This guide presents essential cURL commands for service health checks, API testing, file transfer, debugging, Kubernetes interactions, monitoring, load balancing, and webhook triggering, demonstrating how the versatile tool can streamline automation, CI/CD pipelines, and daily DevOps tasks.

API testingAutomationDevOps

0 likes · 5 min read

Master cURL: Essential Commands for DevOps, Monitoring, and Automation

vivo Internet Technology

Jun 11, 2025 · Big Data

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

This article details Vivo's end‑to‑end Pulsar observability solution, covering the challenges of Prometheus‑based monitoring, the architecture of the alerting pipeline, adaptor development, metric optimizations for subscription backlog and bundle load, and fixes for kop lag reporting issues.

Big DataMetricsObservability

0 likes · 12 min read

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

Linux Ops Smart Journey

Jun 11, 2025 · Cloud Native

Master Cloud‑Native Monitoring: Deploy Prometheus Operator with Helm

This guide explains why traditional monitoring falls short in cloud‑native environments and shows step‑by‑step how to install and configure the Prometheus Operator on Kubernetes using Helm, including custom image settings, storage configuration, and verification of the deployed services.

KubernetesOperatorPrometheus

0 likes · 7 min read

Master Cloud‑Native Monitoring: Deploy Prometheus Operator with Helm

Java Captain

Jun 10, 2025 · Backend Development

Why Spring Batch? Real‑World Scenarios, Core Architecture and Hands‑On Guide

This article explains the necessity of batch processing, presents typical use cases such as daily interest calculation, e‑commerce order archiving, log analysis and medical data migration, then dives deep into Spring Batch's core components, provides step‑by‑step code examples, performance‑tuning tips, production‑grade fault‑tolerance, monitoring solutions and a comprehensive FAQ.

Batch ProcessingJavaSpring Batch

0 likes · 20 min read

Why Spring Batch? Real‑World Scenarios, Core Architecture and Hands‑On Guide

Linux Ops Smart Journey

Jun 6, 2025 · Operations

How to Build a Complete Longhorn Monitoring System with Prometheus & Grafana

This guide explains how to monitor Longhorn storage in Kubernetes by collecting metrics with Prometheus, configuring scraping, verifying data collection, and visualizing everything in Grafana, enabling proactive performance tuning and reliable operations.

GrafanaKubernetesLonghorn

0 likes · 6 min read

How to Build a Complete Longhorn Monitoring System with Prometheus & Grafana

Big Data Technology & Architecture

Jun 5, 2025 · Big Data

Flink Web UI Monitoring and End‑to‑End Latency Implementation Guide

This article explains the key monitoring items of the Flink Web UI, details task topology, operator and system metrics, checkpoint and log inspection, and provides two practical solutions—custom metrics and distributed tracing—to measure and visualize full‑chain latency in Flink jobs.

Big DataFlinkLatency

0 likes · 10 min read

Flink Web UI Monitoring and End‑to‑End Latency Implementation Guide

FunTester

Jun 5, 2025 · Cloud Native

Automating Thread Dump Generation and Retrieval in Kubernetes for Efficient Fault Diagnosis

The article explains how automating thread dump creation and download in Kubernetes using tools like Fabric8, Prometheus, and CI/CD pipelines dramatically improves fault‑diagnosis speed, data centralization, real‑time capture, and integration with testing frameworks, transforming manual, error‑prone processes into streamlined, intelligent operations.

AutomationKubernetesThread Dump

0 likes · 6 min read

Automating Thread Dump Generation and Retrieval in Kubernetes for Efficient Fault Diagnosis

Raymond Ops

Jun 4, 2025 · Operations

Mastering SFTP: Complete Planning, Configuration, and High‑Availability Guide

This guide walks you through SFTP server planning, user naming conventions, directory structures, SSH configuration, account creation, permission setup, client usage, log auditing, rotation, connection limits, monitoring, and high‑availability deployment across multiple servers, providing ready‑to‑run commands and scripts.

LinuxSFTPSSH

0 likes · 14 min read

Mastering SFTP: Complete Planning, Configuration, and High‑Availability Guide

Alibaba Cloud Observability

Jun 3, 2025 · Cloud Native

How PromQL Copilot Turns Natural Language into Precise Monitoring Queries

PromQL Copilot leverages Alibaba Cloud's observability platform and AI techniques to convert ambiguous natural‑language monitoring requests into accurate PromQL statements, addressing challenges of ambiguity, domain knowledge, and metric coverage while providing generation, explanation, diagnosis, and recommendation features for cloud‑native environments.

AICloud NativeMetrics

0 likes · 12 min read

How PromQL Copilot Turns Natural Language into Precise Monitoring Queries

Liangxu Linux

Jun 2, 2025 · Operations

10 Must‑Know Ops Tools to Transform Reactive Firefighting into Proactive Management

This guide presents ten essential operations tools—including Zabbix, Prometheus, MySQL, Redis, Ansible, Jenkins, Docker, Kubernetes, LVS, and Kafka—covering monitoring, databases, automation, containerization, and load balancing, to help engineers shift from reactive firefighting to proactive, efficient system management.

AutomationContainersMessaging

0 likes · 4 min read

10 Must‑Know Ops Tools to Transform Reactive Firefighting into Proactive Management

Linux Ops Smart Journey

May 29, 2025 · Cloud Native

Master Kubernetes Monitoring with kube-state-metrics and Prometheus

This guide walks you through deploying kube-state-metrics, configuring Prometheus scrape jobs, verifying metric collection, and adding Grafana dashboards to achieve a visible, manageable, and reliable Kubernetes monitoring solution for large‑scale clusters.

KubernetesObservabilityPrometheus

0 likes · 7 min read

Master Kubernetes Monitoring with kube-state-metrics and Prometheus

Alibaba Cloud Developer

May 27, 2025 · Artificial Intelligence

How to Build AI-Powered Java Apps with Spring AI and DeepSeek

This guide walks Java developers through integrating Spring AI with large‑model services such as DeepSeek, covering setup, API key configuration, code examples for synchronous and streaming calls, reactive implementation, monitoring with Actuator, and compatibility with OpenAI‑style APIs.

AI integrationDeepSeekJava

0 likes · 9 min read

How to Build AI-Powered Java Apps with Spring AI and DeepSeek

Bilibili Tech

May 27, 2025 · Operations

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

This article presents a comprehensive overview of server fault management at scale, detailing the classification of failures, shortcomings of traditional manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerting, and end‑to‑end repair workflows, while also outlining future directions for intelligent monitoring and reliability.

AutomationInfrastructureOperations

0 likes · 17 min read

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

Java Architecture Diary

May 26, 2025 · Artificial Intelligence

How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer

This article explains why observability is essential for Spring AI applications, outlines common cost‑control and performance challenges, and provides a step‑by‑step guide—including Maven setup, client configuration, service implementation, metric exposure, Zipkin tracing, and architecture insights—to create a fully observable, enterprise‑grade AI translation service.

MicrometerObservabilitySpring AI

0 likes · 12 min read

How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer

MaGe Linux Operations

May 25, 2025 · Cloud Native

Master Docker Volume Management: From Basics to Advanced Ops

This comprehensive guide walks you through Docker volume creation, inspection, mounting, backup, restoration, cross‑host migration, labeling, driver configuration, security permissions, encryption, monitoring, troubleshooting, capacity planning, and automation scripts, providing practical commands and best‑practice recommendations for reliable container storage management.

AutomationContainermonitoring

0 likes · 8 min read

Master Docker Volume Management: From Basics to Advanced Ops

Su San Talks Tech

May 24, 2025 · Backend Development

12 Proven SpringBoot Performance Hacks to Boost Your API Speed

Discover twelve practical SpringBoot performance optimization techniques—from connection pool tuning and JVM memory settings to caching, async processing, and full‑stack monitoring—each illustrated with code snippets and actionable guidance to prevent full‑table scans, OOM errors, and latency spikes in high‑traffic applications.

JVMJavaPerformance Optimization

0 likes · 13 min read

12 Proven SpringBoot Performance Hacks to Boost Your API Speed

DataFunSummit

May 22, 2025 · Operations

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

This article presents Grab's Hugo platform, an automated fault‑detection and self‑healing system for over 4,000 data pipelines that combines multi‑source signal collection, intelligent diagnosis, layered auto‑repair, and a health API to dramatically improve data visibility, reduce manual intervention, and boost operational efficiency across the company.

AutomationBig DataDataOps

0 likes · 12 min read

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

Cloud Native Technology Community

May 22, 2025 · Information Security

How to Prevent Common Kubernetes Security Mistakes and Harden Your Cluster

This article analyzes typical Kubernetes security pitfalls—from weak authentication and overly permissive network policies to missing real‑time monitoring, exposed services, outdated versions, and default component settings—and provides concrete, layered mitigation steps and tool recommendations.

Best PracticesCloud NativeKubernetes

0 likes · 13 min read

How to Prevent Common Kubernetes Security Mistakes and Harden Your Cluster

Big Data Technology & Architecture

May 21, 2025 · Big Data

Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring

This article shares an interviewee's experience discussing core Flink interview questions, including typical resource allocation for large online tasks, common problems such as data, performance, stability, and resource issues, and the monitoring practices for clusters and tasks, while also containing a brief self‑promotion.

Big DataFlinkInterview

0 likes · 7 min read

Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring

Architect's Tech Stack

May 20, 2025 · Operations

Visualizing Nginx Access Logs with Loki and Grafana

This guide explains how to collect Nginx access logs, convert them to JSON, store them in Loki using Promtail, and visualize the data with Grafana dashboards, including installation of required modules, Docker deployment, and world‑map panel configuration.

GrafanaJSONLogging

0 likes · 8 min read

Visualizing Nginx Access Logs with Loki and Grafana

Java Tech Enthusiast

May 18, 2025 · Operations

Ten Rules for Writing High‑Quality Logs in Production Systems

This article presents ten practical rules for producing high‑quality, searchable logs—including unified formatting, stack‑trace inclusion, proper log levels, complete parameters, data masking, asynchronous writing, trace‑ID linking, dynamic level control, structured storage, and intelligent monitoring—to help developers quickly diagnose issues in high‑traffic applications.

Best PracticesLogginglogback

0 likes · 11 min read

Ten Rules for Writing High‑Quality Logs in Production Systems

Liangxu Linux

May 18, 2025 · Operations

How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Debugging Guide

When a midnight CPU alarm triggered, I logged into the server, identified runaway Java processes, profiled the JVM, refactored a costly sorting algorithm, added database indexes, containerized the service, and set up Prometheus alerts, ultimately reducing CPU usage below 30% and restoring millisecond response times.

CPUDockerJVM

0 likes · 6 min read

How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Debugging Guide

Linux Ops Smart Journey

May 16, 2025 · Operations

Turn Jenkins into a Real‑Time Monitoring Hub with Prometheus & Grafana

This guide shows how to integrate Jenkins with Prometheus and Grafana, covering plugin installation, metric endpoint exposure, Prometheus scraping configuration, verification via curl, and importing a ready‑made Grafana dashboard to achieve proactive, visualized CI/CD monitoring.

DevOpsGrafanaJenkins

0 likes · 4 min read

Turn Jenkins into a Real‑Time Monitoring Hub with Prometheus & Grafana

Liangxu Linux

May 15, 2025 · Operations

10 Critical Server Ops Mistakes to Avoid and Real-World Lessons

This article outlines ten common server operation pitfalls—such as forced power‑offs, reckless experiments in production, neglecting firewall rules, running unknown scripts as root, unbacked‑up database changes, weak SSH settings, poor log management, exposed ports, unmonitored changes, and delayed patching—each illustrated with real‑world cases and practical remediation advice.

SecuritySystem Administrationbackup

0 likes · 7 min read

10 Critical Server Ops Mistakes to Avoid and Real-World Lessons

Architect

May 15, 2025 · Operations

How I Rescued a Critical Service: A Step‑by‑Step CPU Overload Debugging Guide

When a midnight CPU alarm threatened service availability, I walked through rapid system checks, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and Prometheus alerting to bring CPU usage back below 30% and restore millisecond‑level response times.

DockerJVMPrometheus

0 likes · 7 min read

How I Rescued a Critical Service: A Step‑by‑Step CPU Overload Debugging Guide

Raymond Ops

May 11, 2025 · Cloud Native

How to Expose Ingress Metrics for Prometheus Monitoring in Kubernetes

This guide details how to expose the nginx‑ingress metrics port, configure static and ServiceMonitor‑based scraping in Prometheus Operator, create necessary secrets, and integrate the metrics into Grafana dashboards, providing a complete Kubernetes‑native solution for monitoring ingress traffic.

Cloud NativePrometheusServiceMonitor

0 likes · 6 min read

How to Expose Ingress Metrics for Prometheus Monitoring in Kubernetes

MaGe Linux Operations

May 11, 2025 · Cloud Native

How to Build a High‑Performance, Highly‑Available Cloud‑Native Ingress Gateway

When an Ingress gateway faces traffic exceeding 100,000 QPS, this guide outlines systematic performance optimizations, configuration tweaks, distributed architecture designs, traffic management, monitoring, and disaster‑recovery strategies—including hardware scaling, kernel tuning, DPDK, rate limiting, horizontal scaling, service mesh integration, and CDN offloading—to achieve high concurrency and high availability.

Scalabilitycloud-nativehigh-availability

0 likes · 8 min read

How to Build a High‑Performance, Highly‑Available Cloud‑Native Ingress Gateway

Raymond Ops

May 9, 2025 · Operations

Build a Complete Prometheus Monitoring Stack with Docker

This tutorial explains Prometheus' core components, shows how to deploy Prometheus Server, Node Exporter, cAdvisor, and Grafana as Docker containers on two hosts, configures scraping and alerting, and demonstrates visualizing metrics with ready‑made Grafana dashboards.

AlertmanagerDockerExporter

0 likes · 8 min read

Build a Complete Prometheus Monitoring Stack with Docker

Qunar Tech Salon

May 9, 2025 · Operations

Kafka Production Optimization: Reducing Load and Improving Compression via Filebeat Tuning

This technical case study details how a high‑traffic Kafka logging cluster was optimized by adjusting Filebeat and Kafka parameters, increasing compression batch size, and tuning Kubernetes settings, resulting in significant reductions in request volume, network traffic, CPU usage, and overall resource consumption.

FilebeatKafkaOperations

0 likes · 10 min read

Kafka Production Optimization: Reducing Load and Improving Compression via Filebeat Tuning

Top Architect

May 8, 2025 · Operations

Centralized Log Collection with Filebeat and Graylog: Configuration, Deployment, and Integration Guide

This article explains how to use Filebeat for log shipping, configure its YAML files, deploy Graylog with Docker and Elasticsearch, and integrate logging into Spring Boot applications, providing step‑by‑step commands, code examples, and best‑practice recommendations for centralized log management.

DockerElasticsearchFilebeat

0 likes · 20 min read

Centralized Log Collection with Filebeat and Graylog: Configuration, Deployment, and Integration Guide

DevOps Operations Practice

May 6, 2025 · Operations

Kubernetes Certificate Management: Common Pitfalls, Detection Methods, and Renewal Procedures

This article explains why Kubernetes certificates often become hidden "time bombs," describes the typical failures caused by expired certificates, and provides practical methods to detect upcoming expirations and safely renew or replace them to keep clusters running smoothly.

KubernetesOperationsSecurity

0 likes · 6 min read

Kubernetes Certificate Management: Common Pitfalls, Detection Methods, and Renewal Procedures

dbaplus Community

Apr 24, 2025 · Operations

How Ctrip Built a Scalable Observability Platform and AIOps Engine for Millions of Metrics and Logs

This article details Ctrip's end‑to‑end observability platform—covering metrics, logging, and tracing—its architecture, data governance, AIOps capabilities, and practical case studies, while addressing challenges like data volume, alert noise, and metric explosion in a massive micro‑service environment.

CtripLoggingMetrics

0 likes · 17 min read

How Ctrip Built a Scalable Observability Platform and AIOps Engine for Millions of Metrics and Logs

Java Captain

Apr 22, 2025 · Operations

Improving Cron Job Stability and Monitoring with Best Practices and Healthchecks

The article analyzes common cron job failures such as accidental deletions, OOM crashes, and lack of monitoring, then proposes standardized Jenkins deployment, automatic server selection, lock mechanisms, queue-based processing, status awareness, and the use of the open‑source Healthchecks system to achieve proactive detection and alerting.

AutomationOperationsTask Scheduling

0 likes · 8 min read

Improving Cron Job Stability and Monitoring with Best Practices and Healthchecks

DeWu Technology

Apr 21, 2025 · Backend Development

Design and Evolution of a Unified Exchange Mall Middleware Platform

The unified exchange mall middleware platform consolidates disparate points‑redemption and lottery flows into a four‑layer architecture—business, gameplay templates, domain models, and downstream services—offering standardized APIs, dynamic RPC routing, Redis‑based inventory control, anti‑fraud safeguards, and built‑in monitoring, thereby cutting development costs, enhancing maintainability, and ensuring system stability.

GolangRPCanti-fraud

0 likes · 18 min read

Design and Evolution of a Unified Exchange Mall Middleware Platform

Linux Ops Smart Journey

Apr 20, 2025 · Operations

Visualize Kubernetes Events: Store in Elasticsearch and Dashboard with Grafana

This guide explains how to store Kubernetes event data in Elasticsearch, configure Logstash and Ruby filters for timestamp correction, and create a Grafana dashboard to visualize and analyze cluster events for improved monitoring and troubleshooting.

ElasticsearchGrafanaK8s Events

0 likes · 4 min read

Visualize Kubernetes Events: Store in Elasticsearch and Dashboard with Grafana

Linux Cloud Computing Practice

Apr 18, 2025 · Operations

Unlock the Full Zabbix 7.0 Manual: Features, Architecture & Installation Guide

This article introduces Zabbix as a powerful open‑source monitoring solution, outlines its 7.0 features, architecture, installation and configuration steps, highlights recent enhancements, and explains how to obtain a free 2000‑page Chinese manual via QR code.

InstallationOperationsconfiguration

0 likes · 4 min read

Unlock the Full Zabbix 7.0 Manual: Features, Architecture & Installation Guide

Efficient Ops

Apr 16, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable operations engineering tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, suitable scenarios, advantages, and real‑world examples, plus sample code snippets to help engineers automate and monitor infrastructure efficiently.

AutomationDevOpsInfrastructure

0 likes · 9 min read

Top 10 Essential Ops Tools Every Engineer Should Master

Linux Ops Smart Journey

Apr 16, 2025 · Operations

How to Build a Robust Elasticsearch Monitoring System with Prometheus & Grafana

Learn step‑by‑step how to deploy the Elasticsearch‑exporter via Helm, configure Prometheus to scrape its metrics, and visualize them in Grafana, enabling comprehensive monitoring of Elasticsearch clusters for performance, health, and early issue detection in Kubernetes environments.

ElasticsearchExporterGrafana

0 likes · 7 min read

How to Build a Robust Elasticsearch Monitoring System with Prometheus & Grafana

Su San Talks Tech

Apr 13, 2025 · Operations

Unlock Zabbix: Complete Guide to Features, Architecture, and Hands‑On Deployment on CentOS

This article introduces Zabbix’s core features, flexible data collection, custom alerting, visualization, high‑availability architecture, security auditing, and compares it with Prometheus, then walks through a step‑by‑step installation, configuration, and deployment on a CentOS server.

InstallationLinuxOperations

0 likes · 20 min read

Unlock Zabbix: Complete Guide to Features, Architecture, and Hands‑On Deployment on CentOS

Architecture and Beyond

Apr 12, 2025 · Backend Development

How to Keep Your AIGC Service Stable: Queueing and Rate‑Limiting Strategies

This article explains why AIGC services need queueing systems and rate‑limiting, describes the user‑facing behaviors of both mechanisms, outlines design goals, compares queue and limiter implementations, and provides practical guidance on selecting middleware, monitoring, and integrating them into a production workflow.

AIGCMessage QueueRate Limiting

0 likes · 28 min read

How to Keep Your AIGC Service Stable: Queueing and Rate‑Limiting Strategies

FunTester

Apr 12, 2025 · Operations

How to Design Effective Fault‑Testing Cases for Resilient Distributed Systems

This article explains why fault testing is essential for modern distributed and cloud environments, outlines core goals, design principles, common fault categories, practical implementation strategies such as chaos engineering and gray releases, and shows how to analyze results to continuously improve system reliability.

Distributed Systemschaos engineeringfault testing

0 likes · 18 min read

How to Design Effective Fault‑Testing Cases for Resilient Distributed Systems

Linux Ops Smart Journey

Apr 8, 2025 · Operations

How to Efficiently Monitor HAProxy with Prometheus and Grafana

This guide explains how to set up HAProxy monitoring by configuring a Prometheus exporter, adding HAProxy targets to Prometheus, verifying metric collection, and visualizing the data in Grafana with a ready-made dashboard, ensuring reliable and performant services.

GrafanaHAProxyKubernetes

0 likes · 4 min read

How to Efficiently Monitor HAProxy with Prometheus and Grafana

Raymond Ops

Apr 7, 2025 · Operations

How to Deploy Prometheus on Kubernetes and Resolve Alertmanager Port Issues

This guide explains what Prometheus monitoring is, walks through downloading the correct version for a Kubernetes cluster, customizing alert rules, deploying and cleaning up Prometheus, and troubleshooting common Alertmanager connection problems by checking DNS and network configurations.

AlertmanagerPrometheusTroubleshooting

0 likes · 9 min read

How to Deploy Prometheus on Kubernetes and Resolve Alertmanager Port Issues

Deepin Linux

Apr 2, 2025 · Operations

Comprehensive Guide to bpftrace: Features, Architecture, Installation, and Practical Use Cases

This article introduces bpftrace, an eBPF‑based dynamic tracing tool for Linux, explains its core concepts, technical architecture, installation methods, basic syntax, and demonstrates real‑world performance analysis, fault diagnosis, and security monitoring scenarios while comparing it with DTrace, SystemTap, and BCC.

DebuggingLinux performanceSystem Tracing

0 likes · 24 min read

Comprehensive Guide to bpftrace: Features, Architecture, Installation, and Practical Use Cases

Mingyi World Elasticsearch

Mar 25, 2025 · Operations

How to Consolidate Monitoring for Multiple Elasticsearch Clusters with INFINI Console

The article analyzes the pain points of managing several Elasticsearch clusters separately, compares native Kibana, custom scripts, and commercial tools, and then walks through a practical implementation using the lightweight INFINI Console to achieve unified, version‑agnostic monitoring and alerting.

ElasticsearchINFINI ConsoleKibana

0 likes · 9 min read

How to Consolidate Monitoring for Multiple Elasticsearch Clusters with INFINI Console

The Dominant Programmer

Mar 22, 2025 · Databases

Common Redis Performance Issues and How to Make Your Cache Fly

This article examines the most frequent Redis performance bottlenecks—including high memory usage, network latency, misconfiguration, poor data‑structure choices, and suboptimal persistence—explains why they occur, and provides concrete optimization techniques, monitoring commands, real‑world case studies, and emerging trends to keep your cache fast and stable.

Data StructuresNetwork LatencyPerformance Optimization

0 likes · 8 min read

Common Redis Performance Issues and How to Make Your Cache Fly

Tencent Cloud Developer

Mar 19, 2025 · Cloud Native

Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure

Monitoring Kubernetes is essential to detect resource contention, component failures, and network issues; it involves tracking core component metrics such as API server latency, etcd write times, scheduler delays, as well as node‑level CPU, memory, disk, and network statistics, pod health, and custom application metrics exposed via Prometheus exporters for comprehensive observability.

Cloud NativeExportersKubernetes

0 likes · 23 min read

Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure

JD Tech

Mar 13, 2025 · Operations

Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards

This article details how JD Logistics guarantees the stability of its Double 11 supply‑chain dashboard by mapping the entire data‑flow, identifying risk points across ingestion, processing, storage, service, and monitoring layers, and applying targeted technical and organizational safeguards.

Big DataSupply Chaindashboard

0 likes · 10 min read

Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards

Alibaba Cloud Native

Mar 13, 2025 · Cloud Native

How to Extend SAE with Sidecar Containers for Custom Logging and Monitoring

This article explains how Alibaba Cloud's Serverless Application Engine (SAE) uses sidecar containers to let users add custom log collection, metric monitoring, and resource isolation without modifying their main application code, detailing configuration modes, operational tools, and a step‑by‑step implementation example.

SAEServerlessmonitoring

0 likes · 12 min read

How to Extend SAE with Sidecar Containers for Custom Logging and Monitoring

php Courses

Mar 13, 2025 · Backend Development

Effective Strategies for Optimizing PHP Application Performance

Optimizing PHP applications involves a combination of code-level improvements—such as caching, efficient algorithms, and query optimization—and server-side configurations like upgrading PHP, enabling opcode caches, tuning web servers, and leveraging CDNs, along with monitoring tools and asynchronous processing to achieve faster, more scalable performance.

CachingPHPPerformance Optimization

0 likes · 5 min read

Effective Strategies for Optimizing PHP Application Performance

JD Tech Talk

Mar 12, 2025 · Big Data

Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies

This article details how the supply‑chain big‑screen dashboard for Double‑11 maintains high stability by mapping the full data‑flow, identifying risk points across ingestion, processing, storage and service layers, and applying comprehensive technical safeguards such as high‑availability design, fault‑tolerance, monitoring, and coordinated operational procedures.

Big DataSupply Chaindashboard

0 likes · 11 min read

Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies

JD Cloud Developers

Mar 12, 2025 · Operations

How to Ensure Double‑11 Supply‑Chain Dashboard Stability: End‑to‑End Strategies

This article details the end‑to‑end technical and operational measures—including full‑chain flow mapping, risk point analysis, layered mitigation tactics, monitoring, and team coordination—used to guarantee the stability and accuracy of the supply‑chain dashboard during the Double‑11 promotion.

Big DataOperationsSupply Chain

0 likes · 15 min read

How to Ensure Double‑11 Supply‑Chain Dashboard Stability: End‑to‑End Strategies

Su San Talks Tech

Mar 12, 2025 · Backend Development

How to Diagnose and Resolve JVM Out‑Of‑Memory Errors: A Practical Checklist

This article walks through a step‑by‑step troubleshooting process for JVM Out‑Of‑Memory (OOM) incidents, covering symptom identification, monitoring tools like top, jstat, and jmap, root‑cause analysis, and actionable recommendations to prevent future memory leaks.

JVMJavaOutOfMemory

0 likes · 5 min read

How to Diagnose and Resolve JVM Out‑Of‑Memory Errors: A Practical Checklist

Lobster Programming

Mar 10, 2025 · Operations

How to Build a Complete SpringBoot Monitoring System with Prometheus and Grafana

This guide walks you through integrating SpringBoot with Prometheus and Grafana, covering dependency setup, YAML configuration, a test controller, Prometheus scrape jobs, and Grafana dashboard creation to achieve real‑time application monitoring and performance analysis.

ActuatorGrafanaPrometheus

0 likes · 7 min read

How to Build a Complete SpringBoot Monitoring System with Prometheus and Grafana

Efficient Ops

Mar 9, 2025 · Artificial Intelligence

Essential LLMOps Tools: Build, Deploy, Monitor, and Manage Large Language Models

LLMOps, the end-to-end methodology for managing large language models, encompasses a curated set of development, deployment, monitoring, and local management tools—such as LangChain, vLLM, LangSmith, and Ollama—enabling practitioners to efficiently build, scale, and maintain AI applications.

AI developmentLLMOpsLarge Language Models

0 likes · 6 min read

Essential LLMOps Tools: Build, Deploy, Monitor, and Manage Large Language Models

dbaplus Community

Mar 5, 2025 · Operations

How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement

This article shares a step‑by‑step account of taking over a content‑risk stability role in early 2024, defining system stability, diagnosing recurring issues, and implementing a three‑phase framework—pre‑emptive reduction, impact mitigation, and post‑incident improvement—to boost success rates, cut incident response time, and achieve a modular architecture.

Incident ResponseJVM OptimizationSRE

0 likes · 20 min read

How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement

Ops Development & AI Practice

Mar 5, 2025 · Cloud Computing

Master Advanced Terraform Techniques: Best Practices for Reliable IaC

This guide presents advanced Terraform techniques and best practices—including code style, modular design, state management, version control, CI/CD integration, security, and monitoring—to help engineers write more professional, maintainable, and secure infrastructure-as-code configurations.

Securitybest-practicesinfrastructure-as-code

0 likes · 12 min read

Master Advanced Terraform Techniques: Best Practices for Reliable IaC

Practical DevOps Architecture

Mar 5, 2025 · Operations

Zabbix Agent Active Mode Workflow and Configuration Guide

This article explains the Zabbix‑Agent active mode workflow, detailing how the agent initiates TCP connections to the Zabbix‑Server to request monitoring items, receives the item list, sends collected data back, and provides step‑by‑step configuration of the agent and server, including template cloning and essential parameters.

Active Modeagent configurationmonitoring

0 likes · 6 min read

Zabbix Agent Active Mode Workflow and Configuration Guide

FunTester

Mar 2, 2025 · Operations

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

The article examines typical fault propagation scenarios such as avalanche effects, cascading failures, resource exhaustion, data pollution, and dependency cycles in distributed systems, and outlines proactive measures like rate limiting, circuit breaking, isolation, monitoring, and chaos engineering to prevent small issues from escalating into large-scale outages.

Circuit BreakerRate Limitingchaos engineering

0 likes · 11 min read

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

Cognitive Technology Team

Mar 1, 2025 · Databases

Understanding and Mitigating Redis Large‑Key Issues

The article explains what constitutes a Redis large key, outlines its performance and stability risks, describes common scenarios and root causes, and provides practical detection commands, mitigation techniques such as splitting, compression, proper data modeling, and monitoring strategies to prevent future issues.

DatabaseMemory OptimizationRedis

0 likes · 6 min read

Understanding and Mitigating Redis Large‑Key Issues

Top Architect

Feb 28, 2025 · Databases

Database Monitoring and Logging: Tools, Commands, and MySQL Slow Query Log Configuration

This article explains how to monitor system resources and record execution logs for databases, introduces Linux commands such as top, iostat, vmstat, shows how to enable and view MySQL slow query logs, and offers best practices and automation tools, while also promoting related AI and community services.

LinuxLoggingMySQL

0 likes · 7 min read

Database Monitoring and Logging: Tools, Commands, and MySQL Slow Query Log Configuration

macrozheng

Feb 21, 2025 · Backend Development

Boost SpringBoot Performance: Monitoring, Profiling, and Optimization Techniques

This guide walks through practical SpringBoot performance improvements, covering metric exposure with Prometheus, flame‑graph profiling via async‑profiler, distributed tracing with SkyWalking, HTTP and Tomcat tuning, and layer‑specific optimizations for controllers, services, and data access.

monitoring

0 likes · 17 min read

Spring Full-Stack Practical Cases

Feb 20, 2025 · Backend Development

Master Spring Boot 3 Monitoring with JavaMelody: Real‑World Cases & Config Guide

This article introduces a growing collection of over 90 Spring Boot 3 practical articles and a PDF ebook, then provides a step‑by‑step tutorial on integrating JavaMelody for monitoring, customizing metrics, and securing the monitoring UI within Spring Boot applications.

JavamelodySecuritySpring Boot

0 likes · 9 min read

Master Spring Boot 3 Monitoring with JavaMelody: Real‑World Cases & Config Guide

Architecture Development Notes

Feb 19, 2025 · Operations

Avoid Prometheus Label Pitfalls: Best Practices for Scalable Monitoring

This article examines common label misuse in Prometheus, explains why adding global labels to every metric can cause data bloat, configuration rigidity, and dimensional pollution, and provides concrete best‑practice patterns, dynamic injection techniques, and governance rules to keep monitoring systems efficient and maintainable.

Best PracticesCloud NativeLabels

0 likes · 7 min read

Avoid Prometheus Label Pitfalls: Best Practices for Scalable Monitoring

dbaplus Community

Feb 18, 2025 · Backend Development

Why 1.5 ms RPC Calls Sometimes Exceed 100 ms? Deep Dive & Elastic Timeout Fix

A detailed investigation reveals why an RPC interface with an average 1.5 ms execution time still experiences hundreds of 100 ms+ timeouts, analyzes framework versus business latency, identifies GC and I/O jitter as root causes, and proposes an elastic timeout strategy to meet five‑nine reliability targets.

LatencyRPCelastic timeout

0 likes · 8 min read

Why 1.5 ms RPC Calls Sometimes Exceed 100 ms? Deep Dive & Elastic Timeout Fix

DevOps Cloud Academy

Feb 17, 2025 · Operations

Top 10 AI Tools Transforming DevOps Engineering

This article reviews ten AI‑powered tools—including Jenkins, Ansible, Puppet, Dynatrace, Splunk, GitHub Copilot, New Relic, Azure DevOps, Prometheus, and Chef—that enhance DevOps workflows through predictive analytics, automated rollback, intelligent monitoring, and code assistance, helping teams achieve faster, more reliable software delivery.

AIAutomationDevOps

0 likes · 14 min read

Top 10 AI Tools Transforming DevOps Engineering

Liangxu Linux

Feb 16, 2025 · Operations

How to Quickly Visualize Shell Commands with Sampler – Install, Configure, and Use

Sampler is a lightweight tool that runs shell commands, visualizes their output, and triggers alerts, using simple YAML configuration; the guide explains why it’s useful, how to install it on macOS, Linux, and Windows, and provides detailed examples of components, triggers, interactive shells, and real‑world database monitoring scenarios.

YAMLalertsmonitoring

0 likes · 14 min read

How to Quickly Visualize Shell Commands with Sampler – Install, Configure, and Use

Deepin Linux

Feb 12, 2025 · Operations

Comprehensive Guide to Linux Server Fault Diagnosis and Troubleshooting

This article provides a detailed overview of common Linux server failures, a step‑by‑step methodology for fault isolation, practical monitoring tools and commands, and a real‑world case study illustrating diagnosis and remediation techniques for production environments.

LinuxTroubleshootingmonitoring

0 likes · 26 min read

Comprehensive Guide to Linux Server Fault Diagnosis and Troubleshooting

ITPUB

Feb 11, 2025 · Operations

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies deploy fragmented monitoring and observability tools yet still struggle to pinpoint incidents; this article analyzes the root causes—under‑utilized tools and scenario‑agnostic data—and offers practical steps to organize metrics, build layered insights, and improve fault‑resolution efficiency.

Data EngineeringIncident ResponseObservability

0 likes · 12 min read

Why Your Monitoring Fails and How to Build Effective Observability Data

Liangxu Linux

Feb 9, 2025 · Fundamentals

Mastering Linux Processes: From Basics to Advanced Monitoring and Management

This guide explains what a process is, how it differs from a program, its lifecycle, how to monitor and interpret process states with ps and top, manage processes using kill, killall, pkill, run jobs in the background with screen or nohup, adjust priorities with nice/renice, and understand load‑average metrics for performance troubleshooting.

LinuxLoad Averagemonitoring

0 likes · 32 min read

Mastering Linux Processes: From Basics to Advanced Monitoring and Management

dbaplus Community

Feb 6, 2025 · Databases

How a MySQL Online Schema Change Platform Evolved from a Single‑Lane Bridge to a Robust 2.0 System

This article recounts the development of ZzoOnlineDDL, a MySQL schema‑change platform, detailing its 1.0 limitations, the 2.0 architectural upgrades, feature set—including intelligent tool selection, timed execution, sharding support, monitoring, and retry mechanisms—and lessons learned from real‑world incidents such as MDL locks, disk pressure, and unique‑index pitfalls.

MySQLOnline DDLSchema Change

0 likes · 34 min read

How a MySQL Online Schema Change Platform Evolved from a Single‑Lane Bridge to a Robust 2.0 System

Efficient Ops

Feb 6, 2025 · Operations

Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices

At the 2024 GOPS Global Operations Conference in Shanghai, Alipay’s monitoring lead Tang Liang presented the challenges, architecture, risk‑prevention practices, and implementation details of the company’s full‑ecosystem availability monitoring system, highlighting its role in DevOps, SRE, and AIOps initiatives.

AvailabilityCloud NativeDevOps

0 likes · 4 min read

Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices

IT Architects Alliance

Feb 5, 2025 · Cloud Native

Performance Optimization Strategies for Cloud‑Native Applications

This article examines the rapid adoption of cloud‑native architectures and presents a comprehensive guide to identifying performance bottlenecks and applying architectural, resource‑management, caching, networking, and tooling techniques—such as Kubernetes, Prometheus, Grafana, and JMeter—to achieve high‑performance, scalable cloud‑native systems.

Cachingcloud-nativemonitoring

0 likes · 22 min read

Performance Optimization Strategies for Cloud‑Native Applications

Rare Earth Juejin Tech Community

Feb 5, 2025 · Frontend Development

Front‑End Tracking (埋点) Overview, Monitoring Types, Performance Metrics, and Implementation Guide

This article explains front‑end tracking concepts, outlines data, performance, and error monitoring, details common performance metrics, compares code‑based, visual, and automatic tracking solutions, and provides practical JavaScript snippets for event collection, error handling, page‑view reporting, and data transmission methods such as XHR, image GIF, and sendBeacon.

FrontendWeb Analyticsmonitoring

0 likes · 16 min read

Front‑End Tracking (埋点) Overview, Monitoring Types, Performance Metrics, and Implementation Guide

JavaEdge

Feb 2, 2025 · Artificial Intelligence

Mastering LLMOps: From Model Deployment to Scalable AI Operations

This article explains LLMOps—its goals, core activities, benefits, best practices, and how using an LLMOps platform like Dify can dramatically cut development time, simplify prompt engineering, data preparation, monitoring, and deployment of large language models.

AI OperationsData ManagementLLMOps

0 likes · 13 min read

Mastering LLMOps: From Model Deployment to Scalable AI Operations

MaGe Linux Operations

Jan 27, 2025 · Operations

Redis Sentinel Deep Dive: High‑Availability Architecture & Automatic Failover

This article explains Redis Sentinel’s role as the official high‑availability solution, detailing its monitoring, notification, automatic failover mechanisms, discovery processes, connection types, down‑state classifications, failover steps, leader election, master selection rules, and data consistency guarantees.

OperationsRedisfailover

0 likes · 18 min read

Redis Sentinel Deep Dive: High‑Availability Architecture & Automatic Failover

Soul Technical Team

Jan 24, 2025 · Operations

Migration from Thanos to VictoriaMetrics: Architecture, Plan, Issues, and Benefits

This article details the end‑to‑end migration from Thanos to VictoriaMetrics, covering background analysis, architectural comparison, a phased migration plan, encountered configuration and performance issues, resolution strategies, and the resulting performance, cost, and scalability improvements for the monitoring system.

ThanosTime-seriesVictoriaMetrics

0 likes · 16 min read

Migration from Thanos to VictoriaMetrics: Architecture, Plan, Issues, and Benefits

Sohu Tech Products

Jan 22, 2025 · Cloud Native

How to Build a Full‑Featured Kubernetes Monitoring Stack with Prometheus & OpenTelemetry

This guide walks through building a complete Kubernetes monitoring stack, covering metric exposure, collection, visualization, alerting, Prometheus configuration for cAdvisor and custom Java apps, dynamic pod discovery, and integrating OpenTelemetry Collector for push‑based observability.

Cloud NativeKubernetesOpenTelemetry

0 likes · 8 min read

How to Build a Full‑Featured Kubernetes Monitoring Stack with Prometheus & OpenTelemetry

Top Architect

Jan 21, 2025 · Backend Development

DynamicTp: A SpringBoot‑Based Dynamic Thread‑Pool Framework for Java Applications

The article introduces DynamicTp, a SpringBoot-based dynamic thread‑pool framework that enables real‑time adjustment, monitoring, and alerting of ThreadPoolExecutor parameters via various configuration centers, outlines its architecture, modules, features, and integration with third‑party components, and provides usage guidance for Java backend developers.

DynamicTpJavaSpringBoot

0 likes · 12 min read

DynamicTp: A SpringBoot‑Based Dynamic Thread‑Pool Framework for Java Applications

Efficient Ops

Jan 19, 2025 · Operations

How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Ops Playbook

After a midnight CPU alarm, I walked through rapid diagnosis, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and enhanced monitoring to bring a high‑load Java service back to stability, illustrating a comprehensive incident‑response workflow for modern operations teams.

CPU troubleshootingDocker deploymentJVM profiling

0 likes · 7 min read

How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Ops Playbook

Linux Cloud Computing Practice

Jan 17, 2025 · Operations

10 Essential Linux Sysadmin Tools Every Engineer Should Master

This guide outlines the ten fundamental Linux operations tools and skills—ranging from basic system knowledge and networking services to shell scripting, text processing, databases, firewalls, monitoring, clustering, and backup—that every aspiring sysadmin should learn and practice thoroughly.

DatabaseOperationsmonitoring

0 likes · 6 min read

10 Essential Linux Sysadmin Tools Every Engineer Should Master

Sohu Tech Products

Jan 15, 2025 · Backend Development

Deep Dive into Druid Connection Pool: Initialization, Retrieval, and Recycling Explained

This technical guide breaks down Alibaba's Druid JDBC connection pool, detailing its initialization process, how connections are fetched and returned, the internal threads and condition‑signal coordination, execution handling, recommended configurations, and monitoring integration, all illustrated with code snippets and diagrams.

Connection PoolDatabaseDruid

0 likes · 23 min read

Deep Dive into Druid Connection Pool: Initialization, Retrieval, and Recycling Explained

Efficient Ops

Jan 13, 2025 · Cloud Native

What’s New in Prometheus 3.0? Explore the Latest Cloud‑Native Monitoring Features

Prometheus 3.0 introduces a brand‑new UI, full UTF‑8 support, native OTLP metrics ingestion, native histograms, performance gains, and guidance on high‑cardinality, alert rule, storage, and high‑availability concerns for modern cloud‑native monitoring deployments.

Cloud NativeOTLPUTF-8

0 likes · 5 min read

What’s New in Prometheus 3.0? Explore the Latest Cloud‑Native Monitoring Features