Tagged articles
2195 articles
Page 4 of 22
Liangxu Linux
Liangxu Linux
Jul 5, 2025 · Operations

Step‑by‑Step Guide to Installing and Configuring Nagios on CentOS 7

This tutorial walks through preparing a CentOS 7 virtual machine, configuring networking, setting up required packages, compiling and installing Nagios Core, adding the Nagios user and Apache integration, configuring the firewall, and finally installing and enabling Nagios plugins for full monitoring capabilities.

InstallationNagiosSystem Administration
0 likes · 8 min read
Step‑by‑Step Guide to Installing and Configuring Nagios on CentOS 7
Java Architect Essentials
Java Architect Essentials
Jul 4, 2025 · Backend Development

Avoid Dependency Nightmares: Best Practices for Building Reusable Spring Boot Starters

The article shares real‑world experiences and step‑by‑step guidelines for creating robust, modular Spring Boot starters—especially for logging and monitoring—covering dependency conflict detection, strict dependency scopes, SPI design, configuration conventions, documentation standards to dramatically improve reuse and reduce integration headaches.

Custom StarterLoggingSpring Boot
0 likes · 11 min read
Avoid Dependency Nightmares: Best Practices for Building Reusable Spring Boot Starters
37 Interactive Technology Team
37 Interactive Technology Team
Jul 4, 2025 · Operations

How Dynamic Thresholds with Prophet Transform Monitoring from Static Alerts to Intelligent Insights

Traditional fixed‑threshold monitoring often triggers noisy alerts during routine business rhythms, but by modeling time‑series patterns with Facebook Prophet to predict dynamic confidence intervals, teams can automatically adjust thresholds, reduce false positives, and accurately detect true anomalies across diverse services.

Anomaly DetectionProphetTime-series
0 likes · 7 min read
How Dynamic Thresholds with Prophet Transform Monitoring from Static Alerts to Intelligent Insights
Big Data Tech Team
Big Data Tech Team
Jul 3, 2025 · Big Data

Master Kafka: A Complete Learning Roadmap from Basics to Advanced Projects

This guide presents a step‑by‑step Kafka learning roadmap covering core concepts, architecture, configuration, monitoring tools, practical project ideas, advanced components like Streams and KSQL, plus code samples and resource recommendations to help beginners become proficient in real‑time data streaming.

Code ExamplesKafkaStreaming
0 likes · 14 min read
Master Kafka: A Complete Learning Roadmap from Basics to Advanced Projects
Linux Ops Smart Journey
Linux Ops Smart Journey
Jul 3, 2025 · Cloud Native

How to Visualize Kubernetes Namespace Resource Usage with Prometheus

This guide walks you through deploying kube-state-metrics, configuring Prometheus to collect CPU, memory and other resource metrics per Kubernetes namespace, setting up ResourceQuota and LimitRange visualizations, and verifying data collection with Helm, Docker, and curl commands, enabling comprehensive cluster health monitoring.

KubernetesPrometheusResourceQuota
0 likes · 7 min read
How to Visualize Kubernetes Namespace Resource Usage with Prometheus
Efficient Ops
Efficient Ops
Jul 2, 2025 · Operations

Master Grafana: Key Features, Installation on Linux & Docker

This guide introduces Grafana, outlines its multi‑source monitoring features, and provides step‑by‑step installation instructions for Linux using systemd and for Docker Compose, including required commands, configuration files, and how to create and save a basic dashboard.

DockerGrafanaInstallation
0 likes · 4 min read
Master Grafana: Key Features, Installation on Linux & Docker
Raymond Ops
Raymond Ops
Jul 2, 2025 · Operations

Master Linux Process Management: From Basics to Advanced Monitoring

This comprehensive guide explains what a process is, how it differs from a program, its lifecycle, and provides detailed instructions for monitoring process status with ps and top, using tools like vmstat, iostat, dstat, managing processes with kill, killall, pkill, background jobs, screen, adjusting priorities, and interpreting system load averages.

LinuxSystem Administrationmonitoring
0 likes · 29 min read
Master Linux Process Management: From Basics to Advanced Monitoring
DeWu Technology
DeWu Technology
Jun 30, 2025 · Operations

How to Build an Effective Asset‑Loss Prevention System for E‑Commerce Platforms

This article explains why asset‑loss (资损) prevention is critical for high‑value e‑commerce finance, outlines a step‑by‑step methodology covering pre‑, in‑ and post‑incident stages, rule discovery, measurement, implementation options, and operational best practices, and shares concrete results and visual diagrams.

asset losse-commercefinancial operations
0 likes · 18 min read
How to Build an Effective Asset‑Loss Prevention System for E‑Commerce Platforms
Lin is Dream
Lin is Dream
Jun 24, 2025 · Backend Development

Master RocketMQ Console: From Zero to Full Monitoring in Minutes

This article walks you through installing and using the RocketMQ Dashboard to monitor topics, brokers, producers, consumers, and message details, explains common pitfalls such as client‑ID conflicts in Docker, and demonstrates how to troubleshoot consumption issues, TPS metrics, and dead‑letter handling.

JavaMessage QueueRocketMQ
0 likes · 9 min read
Master RocketMQ Console: From Zero to Full Monitoring in Minutes
dbaplus Community
dbaplus Community
Jun 23, 2025 · Operations

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

This article shares a year‑long, hands‑on experience of improving backend alert governance at Tencent Meeting, covering why alerts are hard, designing segmented error codes, building unified alert policies, driving team silence‑up, measuring progress, and the tools that make the process sustainable.

Alert ManagementIncident Responsebackend operations
0 likes · 42 min read
How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Jun 18, 2025 · Operations

Comprehensively Manage Elasticsearch 9.X with INFINI Console

The article provides a detailed technical overview of INFINI Console, an open‑source, lightweight governance platform that enables multi‑cluster, cross‑version management, dynamic registration, monitoring, alerting, and developer tools for Elasticsearch 9.X, comparing it with Kibana and highlighting deployment simplicity across various OS and CPU architectures.

Cluster ManagementCross-Version SupportDeployment
0 likes · 11 min read
Comprehensively Manage Elasticsearch 9.X with INFINI Console
DevOps Operations Practice
DevOps Operations Practice
Jun 16, 2025 · Cloud Native

Mastering Kubernetes: 6 Essential Tools for Cluster Management

This article introduces six indispensable tools—kubectl, Helm, Prometheus + Grafana, Istio, Velero, and K9s—that simplify Kubernetes cluster management by covering resource handling, monitoring, networking, security, backup, and interactive UI, helping readers efficiently operate production‑grade clusters.

Cloud NativeCluster ManagementDevOps
0 likes · 7 min read
Mastering Kubernetes: 6 Essential Tools for Cluster Management
Linux Ops Smart Journey
Linux Ops Smart Journey
Jun 16, 2025 · Cloud Native

Mastering PrometheusRule: Streamline Kubernetes Alerting & Recording

This article explains how PrometheusRule, a Kubernetes custom resource, simplifies the management of alerting and recording rules by centralizing configurations, reducing restarts, avoiding conflicts, and enabling version‑controlled, modular monitoring for cloud‑native environments.

Cloud NativeKubernetesPrometheus
0 likes · 6 min read
Mastering PrometheusRule: Streamline Kubernetes Alerting & Recording
Efficient Ops
Efficient Ops
Jun 11, 2025 · Operations

Master cURL: Essential Commands for DevOps, Monitoring, and Automation

This guide presents essential cURL commands for service health checks, API testing, file transfer, debugging, Kubernetes interactions, monitoring, load balancing, and webhook triggering, demonstrating how the versatile tool can streamline automation, CI/CD pipelines, and daily DevOps tasks.

API testingAutomationDevOps
0 likes · 5 min read
Master cURL: Essential Commands for DevOps, Monitoring, and Automation
Java Captain
Java Captain
Jun 10, 2025 · Backend Development

Why Spring Batch? Real‑World Scenarios, Core Architecture and Hands‑On Guide

This article explains the necessity of batch processing, presents typical use cases such as daily interest calculation, e‑commerce order archiving, log analysis and medical data migration, then dives deep into Spring Batch's core components, provides step‑by‑step code examples, performance‑tuning tips, production‑grade fault‑tolerance, monitoring solutions and a comprehensive FAQ.

Batch ProcessingJavaSpring Batch
0 likes · 20 min read
Why Spring Batch? Real‑World Scenarios, Core Architecture and Hands‑On Guide
FunTester
FunTester
Jun 5, 2025 · Cloud Native

Automating Thread Dump Generation and Retrieval in Kubernetes for Efficient Fault Diagnosis

The article explains how automating thread dump creation and download in Kubernetes using tools like Fabric8, Prometheus, and CI/CD pipelines dramatically improves fault‑diagnosis speed, data centralization, real‑time capture, and integration with testing frameworks, transforming manual, error‑prone processes into streamlined, intelligent operations.

AutomationKubernetesThread Dump
0 likes · 6 min read
Automating Thread Dump Generation and Retrieval in Kubernetes for Efficient Fault Diagnosis
Raymond Ops
Raymond Ops
Jun 4, 2025 · Operations

Mastering SFTP: Complete Planning, Configuration, and High‑Availability Guide

This guide walks you through SFTP server planning, user naming conventions, directory structures, SSH configuration, account creation, permission setup, client usage, log auditing, rotation, connection limits, monitoring, and high‑availability deployment across multiple servers, providing ready‑to‑run commands and scripts.

LinuxSFTPSSH
0 likes · 14 min read
Mastering SFTP: Complete Planning, Configuration, and High‑Availability Guide
Alibaba Cloud Observability
Alibaba Cloud Observability
Jun 3, 2025 · Cloud Native

How PromQL Copilot Turns Natural Language into Precise Monitoring Queries

PromQL Copilot leverages Alibaba Cloud's observability platform and AI techniques to convert ambiguous natural‑language monitoring requests into accurate PromQL statements, addressing challenges of ambiguity, domain knowledge, and metric coverage while providing generation, explanation, diagnosis, and recommendation features for cloud‑native environments.

AICloud NativeMetrics
0 likes · 12 min read
How PromQL Copilot Turns Natural Language into Precise Monitoring Queries
Liangxu Linux
Liangxu Linux
Jun 2, 2025 · Operations

10 Must‑Know Ops Tools to Transform Reactive Firefighting into Proactive Management

This guide presents ten essential operations tools—including Zabbix, Prometheus, MySQL, Redis, Ansible, Jenkins, Docker, Kubernetes, LVS, and Kafka—covering monitoring, databases, automation, containerization, and load balancing, to help engineers shift from reactive firefighting to proactive, efficient system management.

AutomationContainersMessaging
0 likes · 4 min read
10 Must‑Know Ops Tools to Transform Reactive Firefighting into Proactive Management
Alibaba Cloud Developer
Alibaba Cloud Developer
May 27, 2025 · Artificial Intelligence

How to Build AI-Powered Java Apps with Spring AI and DeepSeek

This guide walks Java developers through integrating Spring AI with large‑model services such as DeepSeek, covering setup, API key configuration, code examples for synchronous and streaming calls, reactive implementation, monitoring with Actuator, and compatibility with OpenAI‑style APIs.

AI integrationDeepSeekJava
0 likes · 9 min read
How to Build AI-Powered Java Apps with Spring AI and DeepSeek
Bilibili Tech
Bilibili Tech
May 27, 2025 · Operations

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

This article presents a comprehensive overview of server fault management at scale, detailing the classification of failures, shortcomings of traditional manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerting, and end‑to‑end repair workflows, while also outlining future directions for intelligent monitoring and reliability.

AutomationInfrastructureOperations
0 likes · 17 min read
Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook
Java Architecture Diary
Java Architecture Diary
May 26, 2025 · Artificial Intelligence

How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer

This article explains why observability is essential for Spring AI applications, outlines common cost‑control and performance challenges, and provides a step‑by‑step guide—including Maven setup, client configuration, service implementation, metric exposure, Zipkin tracing, and architecture insights—to create a fully observable, enterprise‑grade AI translation service.

MicrometerObservabilitySpring AI
0 likes · 12 min read
How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer
MaGe Linux Operations
MaGe Linux Operations
May 25, 2025 · Cloud Native

Master Docker Volume Management: From Basics to Advanced Ops

This comprehensive guide walks you through Docker volume creation, inspection, mounting, backup, restoration, cross‑host migration, labeling, driver configuration, security permissions, encryption, monitoring, troubleshooting, capacity planning, and automation scripts, providing practical commands and best‑practice recommendations for reliable container storage management.

AutomationContainermonitoring
0 likes · 8 min read
Master Docker Volume Management: From Basics to Advanced Ops
Su San Talks Tech
Su San Talks Tech
May 24, 2025 · Backend Development

12 Proven SpringBoot Performance Hacks to Boost Your API Speed

Discover twelve practical SpringBoot performance optimization techniques—from connection pool tuning and JVM memory settings to caching, async processing, and full‑stack monitoring—each illustrated with code snippets and actionable guidance to prevent full‑table scans, OOM errors, and latency spikes in high‑traffic applications.

JVMJavaPerformance Optimization
0 likes · 13 min read
12 Proven SpringBoot Performance Hacks to Boost Your API Speed
DataFunSummit
DataFunSummit
May 22, 2025 · Operations

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

This article presents Grab's Hugo platform, an automated fault‑detection and self‑healing system for over 4,000 data pipelines that combines multi‑source signal collection, intelligent diagnosis, layered auto‑repair, and a health API to dramatically improve data visibility, reduce manual intervention, and boost operational efficiency across the company.

AutomationBig DataDataOps
0 likes · 12 min read
Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact
Cloud Native Technology Community
Cloud Native Technology Community
May 22, 2025 · Information Security

How to Prevent Common Kubernetes Security Mistakes and Harden Your Cluster

This article analyzes typical Kubernetes security pitfalls—from weak authentication and overly permissive network policies to missing real‑time monitoring, exposed services, outdated versions, and default component settings—and provides concrete, layered mitigation steps and tool recommendations.

Best PracticesCloud NativeKubernetes
0 likes · 13 min read
How to Prevent Common Kubernetes Security Mistakes and Harden Your Cluster
Big Data Technology & Architecture
Big Data Technology & Architecture
May 21, 2025 · Big Data

Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring

This article shares an interviewee's experience discussing core Flink interview questions, including typical resource allocation for large online tasks, common problems such as data, performance, stability, and resource issues, and the monitoring practices for clusters and tasks, while also containing a brief self‑promotion.

Big DataFlinkInterview
0 likes · 7 min read
Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring
Architect's Tech Stack
Architect's Tech Stack
May 20, 2025 · Operations

Visualizing Nginx Access Logs with Loki and Grafana

This guide explains how to collect Nginx access logs, convert them to JSON, store them in Loki using Promtail, and visualize the data with Grafana dashboards, including installation of required modules, Docker deployment, and world‑map panel configuration.

GrafanaJSONLogging
0 likes · 8 min read
Visualizing Nginx Access Logs with Loki and Grafana
Java Tech Enthusiast
Java Tech Enthusiast
May 18, 2025 · Operations

Ten Rules for Writing High‑Quality Logs in Production Systems

This article presents ten practical rules for producing high‑quality, searchable logs—including unified formatting, stack‑trace inclusion, proper log levels, complete parameters, data masking, asynchronous writing, trace‑ID linking, dynamic level control, structured storage, and intelligent monitoring—to help developers quickly diagnose issues in high‑traffic applications.

Best PracticesLogginglogback
0 likes · 11 min read
Ten Rules for Writing High‑Quality Logs in Production Systems
Liangxu Linux
Liangxu Linux
May 15, 2025 · Operations

10 Critical Server Ops Mistakes to Avoid and Real-World Lessons

This article outlines ten common server operation pitfalls—such as forced power‑offs, reckless experiments in production, neglecting firewall rules, running unknown scripts as root, unbacked‑up database changes, weak SSH settings, poor log management, exposed ports, unmonitored changes, and delayed patching—each illustrated with real‑world cases and practical remediation advice.

SecuritySystem Administrationbackup
0 likes · 7 min read
10 Critical Server Ops Mistakes to Avoid and Real-World Lessons
Raymond Ops
Raymond Ops
May 11, 2025 · Cloud Native

How to Expose Ingress Metrics for Prometheus Monitoring in Kubernetes

This guide details how to expose the nginx‑ingress metrics port, configure static and ServiceMonitor‑based scraping in Prometheus Operator, create necessary secrets, and integrate the metrics into Grafana dashboards, providing a complete Kubernetes‑native solution for monitoring ingress traffic.

Cloud NativePrometheusServiceMonitor
0 likes · 6 min read
How to Expose Ingress Metrics for Prometheus Monitoring in Kubernetes
MaGe Linux Operations
MaGe Linux Operations
May 11, 2025 · Cloud Native

How to Build a High‑Performance, Highly‑Available Cloud‑Native Ingress Gateway

When an Ingress gateway faces traffic exceeding 100,000 QPS, this guide outlines systematic performance optimizations, configuration tweaks, distributed architecture designs, traffic management, monitoring, and disaster‑recovery strategies—including hardware scaling, kernel tuning, DPDK, rate limiting, horizontal scaling, service mesh integration, and CDN offloading—to achieve high concurrency and high availability.

Scalabilitycloud-nativehigh-availability
0 likes · 8 min read
How to Build a High‑Performance, Highly‑Available Cloud‑Native Ingress Gateway
Raymond Ops
Raymond Ops
May 9, 2025 · Operations

Build a Complete Prometheus Monitoring Stack with Docker

This tutorial explains Prometheus' core components, shows how to deploy Prometheus Server, Node Exporter, cAdvisor, and Grafana as Docker containers on two hosts, configures scraping and alerting, and demonstrates visualizing metrics with ready‑made Grafana dashboards.

AlertmanagerDockerExporter
0 likes · 8 min read
Build a Complete Prometheus Monitoring Stack with Docker
Java Captain
Java Captain
Apr 22, 2025 · Operations

Improving Cron Job Stability and Monitoring with Best Practices and Healthchecks

The article analyzes common cron job failures such as accidental deletions, OOM crashes, and lack of monitoring, then proposes standardized Jenkins deployment, automatic server selection, lock mechanisms, queue-based processing, status awareness, and the use of the open‑source Healthchecks system to achieve proactive detection and alerting.

AutomationOperationsTask Scheduling
0 likes · 8 min read
Improving Cron Job Stability and Monitoring with Best Practices and Healthchecks
DeWu Technology
DeWu Technology
Apr 21, 2025 · Backend Development

Design and Evolution of a Unified Exchange Mall Middleware Platform

The unified exchange mall middleware platform consolidates disparate points‑redemption and lottery flows into a four‑layer architecture—business, gameplay templates, domain models, and downstream services—offering standardized APIs, dynamic RPC routing, Redis‑based inventory control, anti‑fraud safeguards, and built‑in monitoring, thereby cutting development costs, enhancing maintainability, and ensuring system stability.

GolangRPCanti-fraud
0 likes · 18 min read
Design and Evolution of a Unified Exchange Mall Middleware Platform
Efficient Ops
Efficient Ops
Apr 16, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable operations engineering tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, suitable scenarios, advantages, and real‑world examples, plus sample code snippets to help engineers automate and monitor infrastructure efficiently.

AutomationDevOpsInfrastructure
0 likes · 9 min read
Top 10 Essential Ops Tools Every Engineer Should Master
Architecture and Beyond
Architecture and Beyond
Apr 12, 2025 · Backend Development

How to Keep Your AIGC Service Stable: Queueing and Rate‑Limiting Strategies

This article explains why AIGC services need queueing systems and rate‑limiting, describes the user‑facing behaviors of both mechanisms, outlines design goals, compares queue and limiter implementations, and provides practical guidance on selecting middleware, monitoring, and integrating them into a production workflow.

AIGCMessage QueueRate Limiting
0 likes · 28 min read
How to Keep Your AIGC Service Stable: Queueing and Rate‑Limiting Strategies
FunTester
FunTester
Apr 12, 2025 · Operations

How to Design Effective Fault‑Testing Cases for Resilient Distributed Systems

This article explains why fault testing is essential for modern distributed and cloud environments, outlines core goals, design principles, common fault categories, practical implementation strategies such as chaos engineering and gray releases, and shows how to analyze results to continuously improve system reliability.

Distributed Systemschaos engineeringfault testing
0 likes · 18 min read
How to Design Effective Fault‑Testing Cases for Resilient Distributed Systems
Raymond Ops
Raymond Ops
Apr 7, 2025 · Operations

How to Deploy Prometheus on Kubernetes and Resolve Alertmanager Port Issues

This guide explains what Prometheus monitoring is, walks through downloading the correct version for a Kubernetes cluster, customizing alert rules, deploying and cleaning up Prometheus, and troubleshooting common Alertmanager connection problems by checking DNS and network configurations.

AlertmanagerPrometheusTroubleshooting
0 likes · 9 min read
How to Deploy Prometheus on Kubernetes and Resolve Alertmanager Port Issues
Deepin Linux
Deepin Linux
Apr 2, 2025 · Operations

Comprehensive Guide to bpftrace: Features, Architecture, Installation, and Practical Use Cases

This article introduces bpftrace, an eBPF‑based dynamic tracing tool for Linux, explains its core concepts, technical architecture, installation methods, basic syntax, and demonstrates real‑world performance analysis, fault diagnosis, and security monitoring scenarios while comparing it with DTrace, SystemTap, and BCC.

DebuggingLinux performanceSystem Tracing
0 likes · 24 min read
Comprehensive Guide to bpftrace: Features, Architecture, Installation, and Practical Use Cases
The Dominant Programmer
The Dominant Programmer
Mar 22, 2025 · Databases

Common Redis Performance Issues and How to Make Your Cache Fly

This article examines the most frequent Redis performance bottlenecks—including high memory usage, network latency, misconfiguration, poor data‑structure choices, and suboptimal persistence—explains why they occur, and provides concrete optimization techniques, monitoring commands, real‑world case studies, and emerging trends to keep your cache fast and stable.

Data StructuresNetwork LatencyPerformance Optimization
0 likes · 8 min read
Common Redis Performance Issues and How to Make Your Cache Fly
Tencent Cloud Developer
Tencent Cloud Developer
Mar 19, 2025 · Cloud Native

Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure

Monitoring Kubernetes is essential to detect resource contention, component failures, and network issues; it involves tracking core component metrics such as API server latency, etcd write times, scheduler delays, as well as node‑level CPU, memory, disk, and network statistics, pod health, and custom application metrics exposed via Prometheus exporters for comprehensive observability.

Cloud NativeExportersKubernetes
0 likes · 23 min read
Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure
JD Tech
JD Tech
Mar 13, 2025 · Operations

Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards

This article details how JD Logistics guarantees the stability of its Double 11 supply‑chain dashboard by mapping the entire data‑flow, identifying risk points across ingestion, processing, storage, service, and monitoring layers, and applying targeted technical and organizational safeguards.

Big DataSupply Chaindashboard
0 likes · 10 min read
Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards
Alibaba Cloud Native
Alibaba Cloud Native
Mar 13, 2025 · Cloud Native

How to Extend SAE with Sidecar Containers for Custom Logging and Monitoring

This article explains how Alibaba Cloud's Serverless Application Engine (SAE) uses sidecar containers to let users add custom log collection, metric monitoring, and resource isolation without modifying their main application code, detailing configuration modes, operational tools, and a step‑by‑step implementation example.

SAEServerlessmonitoring
0 likes · 12 min read
How to Extend SAE with Sidecar Containers for Custom Logging and Monitoring
php Courses
php Courses
Mar 13, 2025 · Backend Development

Effective Strategies for Optimizing PHP Application Performance

Optimizing PHP applications involves a combination of code-level improvements—such as caching, efficient algorithms, and query optimization—and server-side configurations like upgrading PHP, enabling opcode caches, tuning web servers, and leveraging CDNs, along with monitoring tools and asynchronous processing to achieve faster, more scalable performance.

CachingPHPPerformance Optimization
0 likes · 5 min read
Effective Strategies for Optimizing PHP Application Performance
JD Tech Talk
JD Tech Talk
Mar 12, 2025 · Big Data

Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies

This article details how the supply‑chain big‑screen dashboard for Double‑11 maintains high stability by mapping the full data‑flow, identifying risk points across ingestion, processing, storage and service layers, and applying comprehensive technical safeguards such as high‑availability design, fault‑tolerance, monitoring, and coordinated operational procedures.

Big DataSupply Chaindashboard
0 likes · 11 min read
Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies
Efficient Ops
Efficient Ops
Mar 9, 2025 · Artificial Intelligence

Essential LLMOps Tools: Build, Deploy, Monitor, and Manage Large Language Models

LLMOps, the end-to-end methodology for managing large language models, encompasses a curated set of development, deployment, monitoring, and local management tools—such as LangChain, vLLM, LangSmith, and Ollama—enabling practitioners to efficiently build, scale, and maintain AI applications.

AI developmentLLMOpsLarge Language Models
0 likes · 6 min read
Essential LLMOps Tools: Build, Deploy, Monitor, and Manage Large Language Models
dbaplus Community
dbaplus Community
Mar 5, 2025 · Operations

How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement

This article shares a step‑by‑step account of taking over a content‑risk stability role in early 2024, defining system stability, diagnosing recurring issues, and implementing a three‑phase framework—pre‑emptive reduction, impact mitigation, and post‑incident improvement—to boost success rates, cut incident response time, and achieve a modular architecture.

Incident ResponseJVM OptimizationSRE
0 likes · 20 min read
How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement
Ops Development & AI Practice
Ops Development & AI Practice
Mar 5, 2025 · Cloud Computing

Master Advanced Terraform Techniques: Best Practices for Reliable IaC

This guide presents advanced Terraform techniques and best practices—including code style, modular design, state management, version control, CI/CD integration, security, and monitoring—to help engineers write more professional, maintainable, and secure infrastructure-as-code configurations.

Securitybest-practicesinfrastructure-as-code
0 likes · 12 min read
Master Advanced Terraform Techniques: Best Practices for Reliable IaC
Practical DevOps Architecture
Practical DevOps Architecture
Mar 5, 2025 · Operations

Zabbix Agent Active Mode Workflow and Configuration Guide

This article explains the Zabbix‑Agent active mode workflow, detailing how the agent initiates TCP connections to the Zabbix‑Server to request monitoring items, receives the item list, sends collected data back, and provides step‑by‑step configuration of the agent and server, including template cloning and essential parameters.

Active Modeagent configurationmonitoring
0 likes · 6 min read
Zabbix Agent Active Mode Workflow and Configuration Guide
FunTester
FunTester
Mar 2, 2025 · Operations

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

The article examines typical fault propagation scenarios such as avalanche effects, cascading failures, resource exhaustion, data pollution, and dependency cycles in distributed systems, and outlines proactive measures like rate limiting, circuit breaking, isolation, monitoring, and chaos engineering to prevent small issues from escalating into large-scale outages.

Circuit BreakerRate Limitingchaos engineering
0 likes · 11 min read
Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems
Cognitive Technology Team
Cognitive Technology Team
Mar 1, 2025 · Databases

Understanding and Mitigating Redis Large‑Key Issues

The article explains what constitutes a Redis large key, outlines its performance and stability risks, describes common scenarios and root causes, and provides practical detection commands, mitigation techniques such as splitting, compression, proper data modeling, and monitoring strategies to prevent future issues.

DatabaseMemory OptimizationRedis
0 likes · 6 min read
Understanding and Mitigating Redis Large‑Key Issues
macrozheng
macrozheng
Feb 21, 2025 · Backend Development

Boost SpringBoot Performance: Monitoring, Profiling, and Optimization Techniques

This guide walks through practical SpringBoot performance improvements, covering metric exposure with Prometheus, flame‑graph profiling via async‑profiler, distributed tracing with SkyWalking, HTTP and Tomcat tuning, and layer‑specific optimizations for controllers, services, and data access.

monitoring
0 likes · 17 min read
Boost SpringBoot Performance: Monitoring, Profiling, and Optimization Techniques
Architecture Development Notes
Architecture Development Notes
Feb 19, 2025 · Operations

Avoid Prometheus Label Pitfalls: Best Practices for Scalable Monitoring

This article examines common label misuse in Prometheus, explains why adding global labels to every metric can cause data bloat, configuration rigidity, and dimensional pollution, and provides concrete best‑practice patterns, dynamic injection techniques, and governance rules to keep monitoring systems efficient and maintainable.

Best PracticesCloud NativeLabels
0 likes · 7 min read
Avoid Prometheus Label Pitfalls: Best Practices for Scalable Monitoring
DevOps Cloud Academy
DevOps Cloud Academy
Feb 17, 2025 · Operations

Top 10 AI Tools Transforming DevOps Engineering

This article reviews ten AI‑powered tools—including Jenkins, Ansible, Puppet, Dynatrace, Splunk, GitHub Copilot, New Relic, Azure DevOps, Prometheus, and Chef—that enhance DevOps workflows through predictive analytics, automated rollback, intelligent monitoring, and code assistance, helping teams achieve faster, more reliable software delivery.

AIAutomationDevOps
0 likes · 14 min read
Top 10 AI Tools Transforming DevOps Engineering
Liangxu Linux
Liangxu Linux
Feb 16, 2025 · Operations

How to Quickly Visualize Shell Commands with Sampler – Install, Configure, and Use

Sampler is a lightweight tool that runs shell commands, visualizes their output, and triggers alerts, using simple YAML configuration; the guide explains why it’s useful, how to install it on macOS, Linux, and Windows, and provides detailed examples of components, triggers, interactive shells, and real‑world database monitoring scenarios.

YAMLalertsmonitoring
0 likes · 14 min read
How to Quickly Visualize Shell Commands with Sampler – Install, Configure, and Use
Deepin Linux
Deepin Linux
Feb 12, 2025 · Operations

Comprehensive Guide to Linux Server Fault Diagnosis and Troubleshooting

This article provides a detailed overview of common Linux server failures, a step‑by‑step methodology for fault isolation, practical monitoring tools and commands, and a real‑world case study illustrating diagnosis and remediation techniques for production environments.

LinuxTroubleshootingmonitoring
0 likes · 26 min read
Comprehensive Guide to Linux Server Fault Diagnosis and Troubleshooting
ITPUB
ITPUB
Feb 11, 2025 · Operations

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies deploy fragmented monitoring and observability tools yet still struggle to pinpoint incidents; this article analyzes the root causes—under‑utilized tools and scenario‑agnostic data—and offers practical steps to organize metrics, build layered insights, and improve fault‑resolution efficiency.

Data EngineeringIncident ResponseObservability
0 likes · 12 min read
Why Your Monitoring Fails and How to Build Effective Observability Data
Liangxu Linux
Liangxu Linux
Feb 9, 2025 · Fundamentals

Mastering Linux Processes: From Basics to Advanced Monitoring and Management

This guide explains what a process is, how it differs from a program, its lifecycle, how to monitor and interpret process states with ps and top, manage processes using kill, killall, pkill, run jobs in the background with screen or nohup, adjust priorities with nice/renice, and understand load‑average metrics for performance troubleshooting.

LinuxLoad Averagemonitoring
0 likes · 32 min read
Mastering Linux Processes: From Basics to Advanced Monitoring and Management
dbaplus Community
dbaplus Community
Feb 6, 2025 · Databases

How a MySQL Online Schema Change Platform Evolved from a Single‑Lane Bridge to a Robust 2.0 System

This article recounts the development of ZzoOnlineDDL, a MySQL schema‑change platform, detailing its 1.0 limitations, the 2.0 architectural upgrades, feature set—including intelligent tool selection, timed execution, sharding support, monitoring, and retry mechanisms—and lessons learned from real‑world incidents such as MDL locks, disk pressure, and unique‑index pitfalls.

MySQLOnline DDLSchema Change
0 likes · 34 min read
How a MySQL Online Schema Change Platform Evolved from a Single‑Lane Bridge to a Robust 2.0 System
Efficient Ops
Efficient Ops
Feb 6, 2025 · Operations

Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices

At the 2024 GOPS Global Operations Conference in Shanghai, Alipay’s monitoring lead Tang Liang presented the challenges, architecture, risk‑prevention practices, and implementation details of the company’s full‑ecosystem availability monitoring system, highlighting its role in DevOps, SRE, and AIOps initiatives.

AvailabilityCloud NativeDevOps
0 likes · 4 min read
Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices
IT Architects Alliance
IT Architects Alliance
Feb 5, 2025 · Cloud Native

Performance Optimization Strategies for Cloud‑Native Applications

This article examines the rapid adoption of cloud‑native architectures and presents a comprehensive guide to identifying performance bottlenecks and applying architectural, resource‑management, caching, networking, and tooling techniques—such as Kubernetes, Prometheus, Grafana, and JMeter—to achieve high‑performance, scalable cloud‑native systems.

Cachingcloud-nativemonitoring
0 likes · 22 min read
Performance Optimization Strategies for Cloud‑Native Applications
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Feb 5, 2025 · Frontend Development

Front‑End Tracking (埋点) Overview, Monitoring Types, Performance Metrics, and Implementation Guide

This article explains front‑end tracking concepts, outlines data, performance, and error monitoring, details common performance metrics, compares code‑based, visual, and automatic tracking solutions, and provides practical JavaScript snippets for event collection, error handling, page‑view reporting, and data transmission methods such as XHR, image GIF, and sendBeacon.

FrontendWeb Analyticsmonitoring
0 likes · 16 min read
Front‑End Tracking (埋点) Overview, Monitoring Types, Performance Metrics, and Implementation Guide
JavaEdge
JavaEdge
Feb 2, 2025 · Artificial Intelligence

Mastering LLMOps: From Model Deployment to Scalable AI Operations

This article explains LLMOps—its goals, core activities, benefits, best practices, and how using an LLMOps platform like Dify can dramatically cut development time, simplify prompt engineering, data preparation, monitoring, and deployment of large language models.

AI OperationsData ManagementLLMOps
0 likes · 13 min read
Mastering LLMOps: From Model Deployment to Scalable AI Operations
Soul Technical Team
Soul Technical Team
Jan 24, 2025 · Operations

Migration from Thanos to VictoriaMetrics: Architecture, Plan, Issues, and Benefits

This article details the end‑to‑end migration from Thanos to VictoriaMetrics, covering background analysis, architectural comparison, a phased migration plan, encountered configuration and performance issues, resolution strategies, and the resulting performance, cost, and scalability improvements for the monitoring system.

ThanosTime-seriesVictoriaMetrics
0 likes · 16 min read
Migration from Thanos to VictoriaMetrics: Architecture, Plan, Issues, and Benefits
Top Architect
Top Architect
Jan 21, 2025 · Backend Development

DynamicTp: A SpringBoot‑Based Dynamic Thread‑Pool Framework for Java Applications

The article introduces DynamicTp, a SpringBoot-based dynamic thread‑pool framework that enables real‑time adjustment, monitoring, and alerting of ThreadPoolExecutor parameters via various configuration centers, outlines its architecture, modules, features, and integration with third‑party components, and provides usage guidance for Java backend developers.

DynamicTpJavaSpringBoot
0 likes · 12 min read
DynamicTp: A SpringBoot‑Based Dynamic Thread‑Pool Framework for Java Applications
Efficient Ops
Efficient Ops
Jan 19, 2025 · Operations

How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Ops Playbook

After a midnight CPU alarm, I walked through rapid diagnosis, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and enhanced monitoring to bring a high‑load Java service back to stability, illustrating a comprehensive incident‑response workflow for modern operations teams.

CPU troubleshootingDocker deploymentJVM profiling
0 likes · 7 min read
How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Ops Playbook
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Jan 17, 2025 · Operations

10 Essential Linux Sysadmin Tools Every Engineer Should Master

This guide outlines the ten fundamental Linux operations tools and skills—ranging from basic system knowledge and networking services to shell scripting, text processing, databases, firewalls, monitoring, clustering, and backup—that every aspiring sysadmin should learn and practice thoroughly.

DatabaseOperationsmonitoring
0 likes · 6 min read
10 Essential Linux Sysadmin Tools Every Engineer Should Master
Sohu Tech Products
Sohu Tech Products
Jan 15, 2025 · Backend Development

Deep Dive into Druid Connection Pool: Initialization, Retrieval, and Recycling Explained

This technical guide breaks down Alibaba's Druid JDBC connection pool, detailing its initialization process, how connections are fetched and returned, the internal threads and condition‑signal coordination, execution handling, recommended configurations, and monitoring integration, all illustrated with code snippets and diagrams.

Connection PoolDatabaseDruid
0 likes · 23 min read
Deep Dive into Druid Connection Pool: Initialization, Retrieval, and Recycling Explained