Tagged articles
2195 articles
Page 6 of 22
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 19, 2024 · Artificial Intelligence

Ensuring Stable AI Agents: Engineering Practices, RAG, and Monitoring

This article shares engineering insights from Hema’s AI smart customer service deployment, detailing key stability factors for AI agents—including hallucination mitigation, memory integration, RAG enhancement, exception handling, and comprehensive monitoring—to improve reliability and performance in real‑world e‑commerce chatbot scenarios.

AI AgentLLMRAG
0 likes · 13 min read
Ensuring Stable AI Agents: Engineering Practices, RAG, and Monitoring
Efficient Ops
Efficient Ops
Aug 14, 2024 · Operations

Building a Real-Time Log Monitoring System with ELK, Kafka, and Python

This article details how to construct a log‑monitoring platform using the ELK stack, Kafka buffering, and a Python scheduler to collect, process, and alert on error logs, offering practical configuration tips and performance optimizations for production environments.

ELKElasticsearchKafka
0 likes · 10 min read
Building a Real-Time Log Monitoring System with ELK, Kafka, and Python
Architect
Architect
Aug 14, 2024 · Backend Development

How to Build a Scalable Distributed Task Scheduling Platform

This article outlines the essential components and design considerations for creating a distributed task scheduling platform, covering triggers, scheduling strategies, executors, task chains, circuit breakers, exception handling, blocking control, service discovery, monitoring, and a management console.

Backend ArchitectureCircuit BreakerDistributed Scheduling
0 likes · 9 min read
How to Build a Scalable Distributed Task Scheduling Platform
JD Tech Talk
JD Tech Talk
Aug 13, 2024 · Frontend Development

Monitoring and Inspection Practices for Enterprise Front‑End Applications

This article describes how a large enterprise front‑end team implements real‑time monitoring, scheduled inspections, alert strategies, performance metrics, error handling, custom reporting, and mobile/native monitoring to ensure system stability, improve user experience, and continuously optimize application performance.

Frontendalertingerror-handling
0 likes · 23 min read
Monitoring and Inspection Practices for Enterprise Front‑End Applications
Tencent Cloud Developer
Tencent Cloud Developer
Aug 13, 2024 · Backend Development

Comprehensive Guide to Backend Development: System Design, Architecture, Networking, Fault Handling, Monitoring, Service Governance, Testing, and Deployment

This comprehensive guide to backend development explains essential system and architecture design principles, networking strategies, fault and exception handling, monitoring and alerting, service governance, testing methodologies, and deployment practices, offering best‑practice advice and highlighting common pitfalls for building reliable, scalable internet services.

Backend DevelopmentDeploymentarchitecture
0 likes · 28 min read
Comprehensive Guide to Backend Development: System Design, Architecture, Networking, Fault Handling, Monitoring, Service Governance, Testing, and Deployment
ITPUB
ITPUB
Aug 11, 2024 · Operations

Scaling Bilibili’s Metrics Platform with VictoriaMetrics and Flink Pre‑aggregation

This article details how Bilibili redesigned its monitoring system to overcome explosive metric growth by separating collection and storage, adopting VictoriaMetrics, implementing zone‑based scheduling, automating PromQL query replacement, and using Flink for efficient pre‑aggregation, resulting in dramatically lower latency and higher stability.

FlinkObservabilityPromQL
0 likes · 31 min read
Scaling Bilibili’s Metrics Platform with VictoriaMetrics and Flink Pre‑aggregation
Bilibili Tech
Bilibili Tech
Aug 9, 2024 · Operations

Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink

The new Monitoring 2.0 architecture separates collection, compute and storage, adopts VictoriaMetrics for compact time‑series storage and a zone‑based scheduler, introduces push‑based ingestion, uses Flink for real‑time pre‑aggregation and automatic PromQL rewrite, delivering ten‑fold query speedups, sub‑300 ms p90 latency, and dramatically higher write and query throughput.

FlinkMetricsObservability
0 likes · 29 min read
Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink
ITPUB
ITPUB
Aug 8, 2024 · Operations

Why Solid Monitoring Must Come Before Observability Projects (And How to Build It)

Before launching costly observability initiatives, ensure your monitoring is comprehensive and efficient, covering business, application, component, resource, network, and endpoint metrics, and that you have the data collection, storage, alerting, and event‑distribution capabilities to turn raw signals into actionable insights.

Observabilityalertingmonitoring
0 likes · 9 min read
Why Solid Monitoring Must Come Before Observability Projects (And How to Build It)
JD Retail Technology
JD Retail Technology
Aug 8, 2024 · Frontend Development

Ensuring Frontend System Stability through Monitoring and Automated Inspection

This article explains how modern front‑end teams ensure system stability and high‑quality operation by implementing comprehensive monitoring and automated inspection, covering background, significance, architecture, real‑time and scheduled checks, performance metrics, alert strategies, error handling, custom reporting, and future improvement plans.

DevOpsWebalerting
0 likes · 24 min read
Ensuring Frontend System Stability through Monitoring and Automated Inspection
Zhuanzhuan Tech
Zhuanzhuan Tech
Aug 7, 2024 · Operations

Building a Dynamic Grafana Dashboard for Push System TraceId Visualization

This article describes how to use Grafana's Flowcharting plugin and Prometheus metrics to create a dynamic, interactive dashboard that visualizes each logical node of a push notification pipeline, enabling rapid trace‑ID based troubleshooting and reducing manual investigation effort.

GrafanaOperationsdynamic-view
0 likes · 11 min read
Building a Dynamic Grafana Dashboard for Push System TraceId Visualization
Open Source Linux
Open Source Linux
Aug 5, 2024 · Operations

How to Manage Over 10,000 Network Devices with Systematic, Automated Operations

This guide outlines a comprehensive, automated strategy for operating more than ten thousand network devices, covering asset documentation, topology planning, unified monitoring, automation scripts, emergency response, security management, regular maintenance, staff training, and visual management tools.

Scalabilityautomationdevice management
0 likes · 6 min read
How to Manage Over 10,000 Network Devices with Systematic, Automated Operations
IT Services Circle
IT Services Circle
Aug 2, 2024 · Operations

Shell Script for Collecting Linux CPU, Memory, and Disk I/O Metrics

This article presents a Bash script that gathers comprehensive Linux system metrics—including CPU core count, utilization percentages, context switches, interrupts, load averages, memory and swap usage, and disk I/O statistics—explaining each command and its purpose for effective server monitoring.

BashLinuxSystemMetrics
0 likes · 13 min read
Shell Script for Collecting Linux CPU, Memory, and Disk I/O Metrics
Open Source Linux
Open Source Linux
Aug 1, 2024 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, ideal use cases, key advantages, and practical examples, while also providing code snippets and visual illustrations to help readers understand and apply them effectively.

Configuration ManagementContainerizationInfrastructure
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master
FunTester
FunTester
Jul 30, 2024 · Operations

Mastering True Observability: Models, Practices, and AI‑Driven Automation

This article explains why true observability is essential for modern software, outlines its five core pillars, details a four‑stage maturity model with benefits and drawbacks, and provides practical steps—including data collection, team organization, and AI automation—to advance from basic monitoring to predictive, self‑healing systems.

AILoggingMaturity Model
0 likes · 13 min read
Mastering True Observability: Models, Practices, and AI‑Driven Automation
JD Cloud Developers
JD Cloud Developers
Jul 29, 2024 · Backend Development

Designing Robust Backend Services: Path Standards, Security, Monitoring, and Degradation Strategies

This article outlines comprehensive best‑practice guidelines for designing robust, secure, and maintainable backend services—covering API path conventions, request handling, parameter design, error codes, dependency management, monitoring, degradation strategies, legacy service handling, encryption, access control, and tamper‑proof mechanisms, with practical code examples.

SecurityService Architectureapi-design
0 likes · 18 min read
Designing Robust Backend Services: Path Standards, Security, Monitoring, and Degradation Strategies
DaTaobao Tech
DaTaobao Tech
Jul 29, 2024 · Operations

Testing Environment Reliability, Routing Isolation, Monitoring, and Efficient Deployment Practices

Alibaba Taotian’s testing platform now lets business owners self‑service reliable environments by binding accounts to isolated routes, monitoring lightweight health metrics with automated self‑healing, accelerating deployments via code caching and JVM tricks, and enabling rapid “time‑travel” scenario testing, while planning tighter observability and production alignment.

ObservabilityTesting Environmentdeployment efficiency
0 likes · 11 min read
Testing Environment Reliability, Routing Isolation, Monitoring, and Efficient Deployment Practices
Liangxu Linux
Liangxu Linux
Jul 28, 2024 · Cloud Native

Avoid These 10 Common Kubernetes Mistakes to Boost Reliability

This article outlines the most frequent Kubernetes pitfalls—such as missing resource requests, omitted health checks, using the :latest tag, over‑privileged containers, insufficient monitoring, default namespace misuse, weak security settings, absent PodDisruptionBudgets, lack of pod anti‑affinity, and improper load‑balancing—and provides concrete commands, YAML examples, and best‑practice recommendations to prevent them.

AutoscalingBest PracticesKubernetes
0 likes · 13 min read
Avoid These 10 Common Kubernetes Mistakes to Boost Reliability
Efficient Ops
Efficient Ops
Jul 28, 2024 · Operations

Building a Resilient, High‑Performance Website: Domains, CDN, Security & Ops

This guide outlines a comprehensive, step‑by‑step strategy for creating a highly available, secure, and scalable website—from buying and protecting multiple domains, configuring DNS and CDN, setting up image and database servers, to implementing monitoring, redundancy, high‑concurrency testing, and disaster‑recovery plans.

CDNhigh availabilitymonitoring
0 likes · 13 min read
Building a Resilient, High‑Performance Website: Domains, CDN, Security & Ops
Architecture and Beyond
Architecture and Beyond
Jul 28, 2024 · Frontend Development

Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices

This extensive article presents a systematic approach to front‑end stability, covering observability systems, full‑chain monitoring, high‑availability design, performance management, risk governance, process mechanisms, and engineering practices to ensure reliable user experiences and business continuity.

FrontendObservabilityhigh-availability
0 likes · 44 min read
Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices
Code Ape Tech Column
Code Ape Tech Column
Jul 26, 2024 · Operations

Bash Scripts for File Consistency Checks, Log Monitoring, and System Automation

This article presents a comprehensive collection of Bash scripts that perform tasks such as verifying file consistency across servers, scheduled log cleaning, network traffic monitoring, numeric analysis in files, automated FTP downloads, interactive number games, Nginx 502 detection, variable assignments, bulk file renaming, IP address validation, and various system administration operations.

BashShell scriptingSystem Administration
0 likes · 24 min read
Bash Scripts for File Consistency Checks, Log Monitoring, and System Automation
21CTO
21CTO
Jul 23, 2024 · Information Security

What the Microsoft Blue‑Screen Crisis Teaches About IT Risk Management

The massive Microsoft blue‑screen outage caused by a faulty CrowdStrike update highlights the dangers of single‑system reliance, poor code quality, insufficient QA, and the need for staged rollouts, robust backup, real‑time monitoring, and proactive incident‑response strategies for modern IT organizations.

IT OperationsIncident Responsedisaster recovery
0 likes · 10 min read
What the Microsoft Blue‑Screen Crisis Teaches About IT Risk Management
Soul Technical Team
Soul Technical Team
Jul 23, 2024 · Big Data

Kafka Stability Challenges and Governance Framework at Soul

This article analyzes the role, application scenarios, stability challenges, and comprehensive governance framework of Apache Kafka at Soul, covering deployment, configuration, monitoring, standard controls, common misuse, and future directions toward cloud‑native solutions.

KafkaOperationsStreaming
0 likes · 30 min read
Kafka Stability Challenges and Governance Framework at Soul
ITPUB
ITPUB
Jul 22, 2024 · Operations

How We Upgraded Watcher to Second‑Level Monitoring for Real‑Time Order Alerts

This article details the end‑to‑end redesign of Quora Travel's Watcher monitoring platform from minute‑level to second‑level precision, covering architectural changes, storage engine migration, client‑side metric collection, server‑side scheduling, dashboard and alarm adaptations, and the resulting operational improvements.

DevOpsObservabilityTime-series
0 likes · 20 min read
How We Upgraded Watcher to Second‑Level Monitoring for Real‑Time Order Alerts
JD Tech
JD Tech
Jul 17, 2024 · Information Security

Service Design Tips and Security Practices for Robust API Development

This article presents comprehensive guidelines for designing flexible, secure, and maintainable API services, covering standardized paths, request handling, parameter design, business logic, exception management, dependency classification, monitoring, degradation strategies, handling legacy services, and encryption measures to ensure robust service architecture.

Error HandlingService Architectureapi-design
0 likes · 19 min read
Service Design Tips and Security Practices for Robust API Development
MaGe Linux Operations
MaGe Linux Operations
Jul 16, 2024 · Cloud Native

How Prometheus Sends Alerts: Rules, Templates, and Frequency Explained

This article explains how Prometheus generates and sends alerts, covering the definition of alert rules with PromQL, grouping, templating, configuring evaluation intervals, deploying a custom alert receiver in Kubernetes, and analyzing alert payloads and delivery frequency, while also detailing alert silencing and resolution behavior.

AlertmanagerGoKubernetes
0 likes · 26 min read
How Prometheus Sends Alerts: Rules, Templates, and Frequency Explained
Alibaba Cloud Observability
Alibaba Cloud Observability
Jul 16, 2024 · Cloud Native

How to Seamlessly Migrate Your Self‑Hosted Prometheus + Thanos to Alibaba Cloud Managed Prometheus

This guide explains why many users still run self‑built Prometheus + Thanos, outlines the common deployment scenarios and pain points, and provides detailed step‑by‑step migration procedures—including metric collection, visualization, and alerting—for moving to Alibaba Cloud's fully managed Prometheus service across Kubernetes, ECS, and IDC environments.

Alibaba CloudCloud NativePrometheus
0 likes · 14 min read
How to Seamlessly Migrate Your Self‑Hosted Prometheus + Thanos to Alibaba Cloud Managed Prometheus
JD Tech
JD Tech
Jul 12, 2024 · Backend Development

Dynamic Thread Pool: Monitoring, Alerting, and Runtime Parameter Adjustment

The article explains the concept of a dynamic thread pool, identifies common pain points such as invisible runtime status, hard‑to‑trace rejections, and slow parameter tuning, and presents a comprehensive solution that includes monitoring, alerting, automatic stack dumping, and live parameter refresh for Java backend services.

Dynamic ConfigurationJavamonitoring
0 likes · 20 min read
Dynamic Thread Pool: Monitoring, Alerting, and Runtime Parameter Adjustment
JD Retail Technology
JD Retail Technology
Jul 12, 2024 · Backend Development

Service Design Tips and Best Practices for Robust API Development

This article explores essential service design considerations beyond standard guidelines, covering API path structuring, request handling, parameter design, security measures, monitoring, degradation strategies, and code examples to help build flexible, secure, and maintainable backend services.

Backend DevelopmentSecurityService Architecture
0 likes · 19 min read
Service Design Tips and Best Practices for Robust API Development
Architect
Architect
Jul 11, 2024 · Backend Development

Architecture Refactoring of a Consumer Installment System: Background, Goals, Design, Deployment, and Monitoring

This article presents a comprehensive case study of refactoring a consumer installment platform, covering business restructuring, technical debt resolution, design of domain and module layers, code redesign with design patterns, phased deployment, monitoring setup, and the overall benefits achieved.

Design Patternsarchitecturemicroservices
0 likes · 11 min read
Architecture Refactoring of a Consumer Installment System: Background, Goals, Design, Deployment, and Monitoring
Software Development Quality
Software Development Quality
Jul 11, 2024 · Information Security

How to Implement Secure and Compliant Log Management Standards

This guide outlines the purpose, scope, principles, and detailed specifications for log management—including file naming, retention periods, content rules, security handling, and monitoring—to ensure reliable issue tracing, data safety, and regulatory compliance across all system development projects.

ComplianceLog ManagementOperations
0 likes · 12 min read
How to Implement Secure and Compliant Log Management Standards
Alibaba Cloud Native
Alibaba Cloud Native
Jul 10, 2024 · Cloud Native

Migrate Self‑Hosted Prometheus + Thanos to Alibaba Cloud Managed Service

This guide explains how to move from a self‑built open‑source Prometheus + Thanos monitoring stack to Alibaba Cloud's fully managed Prometheus service, covering typical deployment scenarios, migration requirements, step‑by‑step procedures for metric collection, visualization, and alerting, and key considerations for each environment.

Alibaba CloudPrometheusThanos
0 likes · 15 min read
Migrate Self‑Hosted Prometheus + Thanos to Alibaba Cloud Managed Service
Cloud Native Technology Community
Cloud Native Technology Community
Jul 9, 2024 · Cloud Native

Answering the Top 9 Questions About Monitoring in Kubernetes

This article discusses essential Kubernetes monitoring topics, including cost tracking, tool selection, observability frameworks, responsibility allocation, baseline establishment, namespace best practices, the importance of monitoring, backup solutions, and a comparison of Datadog versus Splunk for metrics.

DatadogKubernetesObservability
0 likes · 6 min read
Answering the Top 9 Questions About Monitoring in Kubernetes
Liangxu Linux
Liangxu Linux
Jul 8, 2024 · Operations

7 Practical Linux Performance Optimization Tips Every Engineer Should Know

This article compiles seven hands‑on Linux performance‑optimization practices, covering key factors such as CPU, memory, disk I/O, network, swap usage, and TCP tuning, and provides concrete commands and step‑by‑step troubleshooting methods for system administrators and DevOps engineers.

LinuxOptimizationSwap
0 likes · 19 min read
7 Practical Linux Performance Optimization Tips Every Engineer Should Know
JD Tech
JD Tech
Jul 8, 2024 · Operations

System Stability Practices: From Development to Production

This article outlines comprehensive system stability strategies for backend development, covering technical design reviews, key reliability techniques such as rate limiting, circuit breaking, timeout handling, isolation, and deployment safeguards like monitoring, gray releases, and rollback, aiming to reduce incidents and improve operational resilience.

Incident Responsemonitoringsystem stability
0 likes · 26 min read
System Stability Practices: From Development to Production
Efficient Ops
Efficient Ops
Jul 7, 2024 · Operations

Boost Business Continuity and IT System Stability: Practical Strategies

This article explains business continuity concepts, outlines the risks to IT system stability, and provides actionable steps—such as expanding monitoring coverage, improving fault detection, enhancing architecture resilience, and strengthening emergency coordination—to ensure continuous operation despite inevitable failures.

business continuitydisaster recoveryfault management
0 likes · 7 min read
Boost Business Continuity and IT System Stability: Practical Strategies
JD Retail Technology
JD Retail Technology
Jul 5, 2024 · Backend Development

Dynamic Thread Pool: Monitoring, Alerting, and Runtime Parameter Adjustment

The article explains the concept of dynamic thread pools, analyzes common pain points such as invisible runtime status, hard‑to‑locate rejections, and slow parameter tuning, and presents a comprehensive solution that includes monitoring, alerting, automatic stack tracing, and on‑the‑fly parameter refresh using Java code.

Dynamic ConfigurationJavamonitoring
0 likes · 19 min read
Dynamic Thread Pool: Monitoring, Alerting, and Runtime Parameter Adjustment
转转QA
转转QA
Jul 5, 2024 · Backend Development

Design and Implementation of a Configuration Checking Tool for an After‑Sales System

The article describes how a configuration‑checking tool was designed and built to automatically compare baseline business configuration data with the after‑sales system's settings, detect mismatches before use, and alert responsible testers, thereby reducing manual verification effort and preventing workflow disruptions.

Configuration ManagementSystem Designbackend
0 likes · 5 min read
Design and Implementation of a Configuration Checking Tool for an After‑Sales System
DevOps Operations Practice
DevOps Operations Practice
Jul 4, 2024 · Operations

Building an Enterprise‑Level Monitoring System: Requirements, Technology Selection, Architecture, Implementation Steps, and Maintenance

This article provides a comprehensive guide to designing and deploying an enterprise‑grade monitoring system, covering requirement analysis, tool selection such as Prometheus and Zabbix, system architecture, step‑by‑step implementation, alerting, visualization, and ongoing maintenance to ensure reliable IT operations.

GrafanaOperationsPrometheus
0 likes · 7 min read
Building an Enterprise‑Level Monitoring System: Requirements, Technology Selection, Architecture, Implementation Steps, and Maintenance
macrozheng
macrozheng
Jul 3, 2024 · Operations

How to Visualize SpringBoot Metrics with Grafana and Prometheus Using Docker

This guide walks through installing Grafana and Prometheus with Docker, configuring node_exporter to collect system metrics, adding SpringBoot Actuator and Micrometer for application metrics, setting up Prometheus scrape jobs, and importing ready‑made Grafana dashboards to achieve real‑time monitoring and alerting.

DockerGrafanaPrometheus
0 likes · 10 min read
How to Visualize SpringBoot Metrics with Grafana and Prometheus Using Docker
360 Smart Cloud
360 Smart Cloud
Jul 3, 2024 · Operations

Practical Practices for Enhancing Kafka Cluster Stability at 360

This article details 360's comprehensive approach to improving Apache Kafka cluster stability through proactive operations, capacity assessment, parameter tuning, monitoring, version upgrades, and traffic control, offering concrete guidelines and best‑practice recommendations for large‑scale message‑queue deployments.

ClusterKafkaUpgrade
0 likes · 33 min read
Practical Practices for Enhancing Kafka Cluster Stability at 360
Alibaba Cloud Native
Alibaba Cloud Native
Jul 2, 2024 · Cloud Native

How Go Agent Enables Zero‑Intrusion Monitoring for Golang Microservices on Kubernetes

This guide explains how the Go Agent injects observability code at compile time to provide automatic tracing and metrics for Golang microservices running on Kubernetes, covering its architecture, supported SDKs, compatibility, and step‑by‑step deployment instructions including component installation, binary compilation, and YAML configuration.

ARMSGoInstrumentation
0 likes · 17 min read
How Go Agent Enables Zero‑Intrusion Monitoring for Golang Microservices on Kubernetes
High Availability Architecture
High Availability Architecture
Jun 28, 2024 · Backend Development

Deep Dive into pfinder: Architecture, Bytecode Enhancement, and Tracing Mechanisms

This article provides a comprehensive technical overview of pfinder, JD's next‑generation APM system, covering its core concepts, feature set, comparison with other tracing tools, bytecode modification techniques using ASM, Javassist, ByteBuddy and ByteKit, Java agent injection via JVMTI and Instrumentation, plugin loading, trace‑ID propagation across threads, and a prototype hot‑deployment capability.

APMBytecodeInstrumentationJava
0 likes · 23 min read
Deep Dive into pfinder: Architecture, Bytecode Enhancement, and Tracing Mechanisms
Open Source Linux
Open Source Linux
Jun 27, 2024 · Operations

Comprehensive Guide to Building a Resilient, High‑Performance Web Infrastructure

This guide outlines essential steps for creating a robust, high‑availability website architecture, covering domain acquisition, DNS management, CDN deployment, image caching, data center selection, monitoring, DDoS mitigation, redundancy, server configuration, database replication, testing environments, security practices, and operational tooling.

Cloud ServicesDDoS protectionOperations
0 likes · 12 min read
Comprehensive Guide to Building a Resilient, High‑Performance Web Infrastructure
Efficient Ops
Efficient Ops
Jun 25, 2024 · Operations

Mastering the Four Golden Signals: A Practical Guide to System Monitoring

This guide explains how to use the four golden signals—latency, traffic, errors, and saturation—to design effective monitoring across servers, services, and external dependencies, helping teams detect issues early and maintain reliable, high‑performance systems.

SREmonitoringsystem reliability
0 likes · 20 min read
Mastering the Four Golden Signals: A Practical Guide to System Monitoring
dbaplus Community
dbaplus Community
Jun 24, 2024 · Operations

How Qunar Achieved Sub‑Second Monitoring to Slash Fault Detection Time to Under 1 Minute

Qunar’s Watcher monitoring platform was upgraded from minute‑level to second‑level precision, redesigning storage, data collection, and alerting pipelines, adopting VictoriaMetrics, enhancing client SDKs, and adding fine‑grained alarm rules, which reduced fault detection from four minutes to under one minute while improving reliability and scalability.

DevOpsObservabilityTime Series Database
0 likes · 20 min read
How Qunar Achieved Sub‑Second Monitoring to Slash Fault Detection Time to Under 1 Minute
Efficient Ops
Efficient Ops
Jun 23, 2024 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, typical use cases, key advantages, and real‑world examples, while also providing a practical Shell script and an Ansible playbook to illustrate automation in daily workflows.

Infrastructuredevops toolsmonitoring
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master
ITPUB
ITPUB
Jun 22, 2024 · Cloud Native

How to Detect and Prevent OOM and CPU Throttling in Kubernetes

This article explains why memory OOM and CPU throttling are critical issues in Kubernetes, shows how limits and requests work, demonstrates monitoring techniques with Prometheus and cAdvisor, and provides practical best‑practice recommendations to avoid pod eviction and performance degradation.

CPU throttlingKubernetesmonitoring
0 likes · 9 min read
How to Detect and Prevent OOM and CPU Throttling in Kubernetes
Qunar Tech Salon
Qunar Tech Salon
Jun 14, 2024 · Operations

Design and Implementation of a Second-Level Monitoring System for Qunar Travel

This article details the background, overall architecture, challenges, and step‑by‑step redesign of Qunar Travel's Watcher monitoring platform to achieve second‑level (per‑second) data collection, storage, and alerting, including storage engine selection, client and server optimizations, deployment strategies, and operational outcomes.

DevOpsmonitoringtime-series database
0 likes · 17 min read
Design and Implementation of a Second-Level Monitoring System for Qunar Travel
Practical DevOps Architecture
Practical DevOps Architecture
Jun 13, 2024 · Operations

Comprehensive Data Center Operations Training Course Overview

This extensive training program covers everything a data center operations engineer needs—from foundational infrastructure management and server hardware maintenance to advanced network configuration, security hardening, monitoring, fault handling, and practical hands‑on skills for real‑world challenges.

Data CenterInfrastructureOperations
0 likes · 6 min read
Comprehensive Data Center Operations Training Course Overview
Qunar Tech Salon
Qunar Tech Salon
Jun 12, 2024 · Artificial Intelligence

Design and Implementation of Qunar Flight Ticket Intelligent Alert (Radar) System

This article presents a comprehensive analysis and engineering of Qunar's flight‑ticket intelligent pre‑warning (Radar) system, covering the business need, value analysis, architectural redesign, feature extraction, indicator classification, accuracy quantification, multi‑algorithm anomaly detection, automatic parameter tuning, observed effects, and future plans to incorporate large‑model techniques.

Anomaly DetectionMachine LearningOperations
0 likes · 17 min read
Design and Implementation of Qunar Flight Ticket Intelligent Alert (Radar) System
Open Source Tech Hub
Open Source Tech Hub
Jun 10, 2024 · Operations

How to Set Up Zipkin Distributed Tracing in PHP Webman Projects

This guide explains Zipkin's architecture, data collection methods, and step‑by‑step installation and configuration for PHP applications, including creating tracers, recording spans, and integrating a middleware for full‑stack monitoring in Webman microservice environments.

PHPWebmandistributed tracing
0 likes · 8 min read
How to Set Up Zipkin Distributed Tracing in PHP Webman Projects
DevOps Cloud Academy
DevOps Cloud Academy
Jun 4, 2024 · Operations

Comprehensive DevOps Guide: Collaboration, Automation, CI/CD, IaC, Monitoring, and Logging with Practical Code Examples

This comprehensive DevOps guide explains core concepts such as collaboration, automation, CI/CD pipelines, infrastructure as code, and monitoring/logging, and includes practical code examples for Git, shell scripts, Jenkins, GitHub Actions, AWS CodePipeline, Ansible, Docker Compose, Prometheus, Grafana, Fluentd, and Elasticsearch.

DevOpsLoggingautomation
0 likes · 17 min read
Comprehensive DevOps Guide: Collaboration, Automation, CI/CD, IaC, Monitoring, and Logging with Practical Code Examples
Efficient Ops
Efficient Ops
Jun 2, 2024 · Operations

Why Observability Is the Key to Reliable Distributed Systems

Observability, defined as measuring system state through logs, metrics, and tracing, enhances stability of distributed architectures by enabling rapid fault detection, deeper insight, and proactive issue resolution, distinguishing it from traditional monitoring and supporting DevOps, SRE, and business objectives.

Distributed Systemsmonitoring
0 likes · 17 min read
Why Observability Is the Key to Reliable Distributed Systems
DevOps Cloud Academy
DevOps Cloud Academy
May 31, 2024 · Cloud Native

Optimizing RabbitMQ Performance on Kubernetes

This guide explains how to deploy RabbitMQ on Kubernetes and improve its performance through Helm installation, resource tuning, monitoring, scaling, security hardening, and advanced configuration techniques, providing practical code examples for each step.

KubernetesPerformance OptimizationRabbitMQ
0 likes · 9 min read
Optimizing RabbitMQ Performance on Kubernetes
Efficient Ops
Efficient Ops
May 29, 2024 · Operations

Essential Operations Metrics Every IT Team Should Track

In today’s competitive business landscape, tracking key operations metrics—such as availability, failure rate, MTTR, MTBF, response time, throughput, error rate, and various utilization and data integrity measures—helps organizations monitor performance, reduce costs, ensure reliability, and maintain regulatory compliance.

AvailabilityIT performancemonitoring
0 likes · 7 min read
Essential Operations Metrics Every IT Team Should Track
dbaplus Community
dbaplus Community
May 28, 2024 · Operations

Top 10 Essential Tools Every Operations Engineer Should Master

This guide reviews ten indispensable tools for operations engineers, detailing each tool's functions, ideal scenarios, advantages, and real‑world examples, and includes practical code snippets for automation, monitoring, container management, and log analysis.

DevOpsInfrastructureautomation
0 likes · 8 min read
Top 10 Essential Tools Every Operations Engineer Should Master
Efficient Ops
Efficient Ops
May 28, 2024 · Operations

How to Build a Resilient High‑Traffic Website: Domains, CDN, Monitoring, and Security

This guide outlines practical steps for creating a highly available, secure, and scalable website—including domain strategy, CDN deployment, image caching, data‑center selection, monitoring, attack mitigation, redundancy, server configuration, database replication, testing environments, disaster‑recovery planning, and high‑concurrency testing.

high availabilitymonitoringwebsite infrastructure
0 likes · 12 min read
How to Build a Resilient High‑Traffic Website: Domains, CDN, Monitoring, and Security
DataFunTalk
DataFunTalk
May 28, 2024 · Big Data

Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi

This article details how the Dongchedi business team designs, implements, and monitors a comprehensive metric system within its data warehouse, covering metric standards, model construction, metadata management, quality monitoring, application scenarios, and future directions using the DataLeap platform.

Big DataData GovernanceData Warehouse
0 likes · 18 min read
Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi
iQIYI Technical Product Team
iQIYI Technical Product Team
May 24, 2024 · Operations

High Availability and Disaster Recovery Practices of iQIYI's Video Relay Service (VRS)

iQIYI’s Video Relay Service ensures uninterrupted video playback by employing a two‑region, three‑center hybrid cloud architecture, multi‑layer storage, cross‑AZ retry mechanisms, protective rate‑limiting and degradation paths, layered monitoring, and rigorous stress‑testing and chaos engineering to achieve high availability and disaster recovery.

Backend ArchitectureCloud NativeVideo Streaming
0 likes · 18 min read
High Availability and Disaster Recovery Practices of iQIYI's Video Relay Service (VRS)
Practical DevOps Architecture
Practical DevOps Architecture
May 22, 2024 · Operations

SRE & Linux Operations Course Outline

This article presents a detailed curriculum covering fundamental infrastructure, cluster architecture, automation, log collection, Linux system administration, containerization, monitoring, security, and related DevOps tools across multiple phases and daily modules for comprehensive SRE training.

SREautomationcloud
0 likes · 8 min read
SRE & Linux Operations Course Outline
Tencent Cloud Developer
Tencent Cloud Developer
May 21, 2024 · Operations

Why Prometheus Metrics Aren’t 100% Accurate – The Hidden Trade‑offs Explained

The article analyzes why Prometheus sometimes returns inaccurate metric values, revealing the design trade‑offs that favor efficiency over precision, and walks through common pitfalls in rate/increase calculations, histogram P99 estimation, and practical recommendations for choosing scrape intervals and query windows.

HistogramMetricsObservability
0 likes · 20 min read
Why Prometheus Metrics Aren’t 100% Accurate – The Hidden Trade‑offs Explained
Qunar Tech Salon
Qunar Tech Salon
May 20, 2024 · Big Data

Optimizing Kafka Production at Qunar Travel: Reducing CPU Usage by 2000 Cores

This article presents a comprehensive case study of how Qunar Travel identified and resolved Kafka production bottlenecks—through metric monitoring, thread and flush parameter tuning, and Filebeat batch adjustments—resulting in a 2000‑core CPU reduction, higher network idle rates, and lower resource consumption across three clusters.

Kafkamonitoring
0 likes · 12 min read
Optimizing Kafka Production at Qunar Travel: Reducing CPU Usage by 2000 Cores
Qunar Tech Salon
Qunar Tech Salon
May 13, 2024 · Operations

Root Cause Analysis of Intermittent Timeout Issues in the Sirius Service Caused by RAID Card Consistency Checks

This article details the investigation of sporadic interface timeouts in the Sirius real‑time pricing service, revealing a weekly pattern linked to RAID controller consistency checks that cause IO spikes, logback queue blockage, and ultimately Dubbo client‑side timeouts, and proposes mitigation steps and general performance‑troubleshooting guidelines.

OperationsRAIDlogback
0 likes · 22 min read
Root Cause Analysis of Intermittent Timeout Issues in the Sirius Service Caused by RAID Card Consistency Checks
Liangxu Linux
Liangxu Linux
May 12, 2024 · Operations

7 Practical Linux Performance Optimization Tips Every Engineer Should Know

This guide explains the key factors that affect Linux system performance, provides step‑by‑step troubleshooting methods for CPU, memory, disk I/O and network issues, shows how to identify top resource‑hungry processes, clarifies memory reporting differences, discusses swap usage scenarios, and offers concrete TCP tuning parameters for production environments.

LinuxOptimizationTCP
0 likes · 20 min read
7 Practical Linux Performance Optimization Tips Every Engineer Should Know
Java Tech Enthusiast
Java Tech Enthusiast
May 5, 2024 · Information Security

Preventing Malicious API Abuse: Security Measures and Best Practices

To prevent malicious API abuse, implement layered defenses such as firewalls to block unwanted traffic, robust captchas and SMS verification, mandatory authentication with permission controls, IP whitelisting for critical endpoints, HTTPS encryption, strict rate‑limiting via Redis, continuous monitoring with alerts, and an API gateway that centralizes filtering, authentication and throttling.

API SecurityCaptchaIP whitelist
0 likes · 9 min read
Preventing Malicious API Abuse: Security Measures and Best Practices
DevOps Operations Practice
DevOps Operations Practice
May 2, 2024 · Operations

Quick Deployment of a Zabbix Monitoring Platform Using Docker

This article explains how to set up a Zabbix monitoring system by installing Docker, pulling necessary images, creating storage volumes, and running containers for MySQL, Zabbix server, Java gateway, web interface, and agents, providing a fast, container‑based deployment solution.

Container DeploymentLinuxmonitoring
0 likes · 8 min read
Quick Deployment of a Zabbix Monitoring Platform Using Docker
Architect
Architect
Apr 27, 2024 · Information Security

How to Stop Malicious API Calls: 8 Practical Defense Strategies

This article walks through eight concrete techniques—firewall rules, captchas, authentication checks, IP whitelists, HTTPS encryption, rate limiting, monitoring, and an API gateway—to prevent abusive requests from draining resources or compromising critical services.

API SecurityAuthenticationCaptcha
0 likes · 11 min read
How to Stop Malicious API Calls: 8 Practical Defense Strategies
ITPUB
ITPUB
Apr 22, 2024 · Backend Development

How Meta Achieves Near‑Perfect Cache Consistency: Lessons from Polaris

This article explains Meta's approach to cache invalidation and consistency, detailing why ultra‑high consistency matters, how their Polaris monitoring system detects and resolves inconsistencies, and provides a simplified Python example that illustrates the underlying mechanisms and challenges.

ConsistencyDistributed SystemsMeta
0 likes · 12 min read
How Meta Achieves Near‑Perfect Cache Consistency: Lessons from Polaris
21CTO
21CTO
Apr 22, 2024 · Operations

Discover Guider: A Python‑Powered Linux Observability Suite with 150+ Commands

Guider, a Python‑based Linux observability suite created by Hyundai engineer Peace Lee, offers over 150 command‑line tools for real‑time performance monitoring, resource tracing, automated reporting, and visualizations, enabling developers to diagnose slow startups, crashes, GPU stalls, and system resets with microsecond precision.

CLILinuxObservability
0 likes · 7 min read
Discover Guider: A Python‑Powered Linux Observability Suite with 150+ Commands
Liangxu Linux
Liangxu Linux
Apr 15, 2024 · Operations

12 Essential Linux Commands to Monitor Memory Usage

This guide presents twelve practical Linux techniques—from basic commands like free and top to advanced tools such as Grafana with Prometheus—enabling administrators to comprehensively track memory consumption, identify bottlenecks, and maintain system stability and performance.

MemorySystem Administrationcommands
0 likes · 8 min read
12 Essential Linux Commands to Monitor Memory Usage
dbaplus Community
dbaplus Community
Apr 14, 2024 · Backend Development

How Meta Reached 99.99999999% Cache Consistency and What You Can Learn

This article explains Meta's approach to cache invalidation and consistency, why ultra‑high consistency matters for user experience, the monitoring infrastructure they built, the Polaris system that detects and repairs inconsistencies, and provides a concrete Python‑style code example illustrating the problem and solution.

CacheConsistencyMeta
0 likes · 13 min read
How Meta Reached 99.99999999% Cache Consistency and What You Can Learn
Efficient Ops
Efficient Ops
Apr 14, 2024 · Operations

How to Ensure System Stability and High Availability: An SRE Playbook

This article explains the definitions of stability and high availability, distinguishes their relationship, outlines key performance indicators, and provides a comprehensive framework—including fault prevention, detection, and recovery, as well as design, coding, testing, monitoring, and emergency response practices—to help teams build reliable, highly available systems.

SREcapacity planninghigh availability
0 likes · 10 min read
How to Ensure System Stability and High Availability: An SRE Playbook
JavaEdge
JavaEdge
Apr 13, 2024 · Backend Development

Mastering System Performance: Metrics, Strategies, and Real‑World Implementation

This article explains why performance optimization is essential for growing systems, introduces key metrics such as response time and concurrency, outlines systematic thinking and concrete techniques—including caching, parallelism, and async processing—and demonstrates a live‑streaming case study with actionable solutions.

CachingConcurrencyOptimization
0 likes · 15 min read
Mastering System Performance: Metrics, Strategies, and Real‑World Implementation
Ops Development Stories
Ops Development Stories
Apr 12, 2024 · Cloud Native

Mastering etcd: Architecture, Monitoring & Performance Tuning

This article provides a comprehensive overview of etcd—including its origins, role in Kubernetes, version evolution, layered architecture, key terminology, operational commands, monitoring metrics, benchmarking procedures, disk‑performance testing, and tuning recommendations—for building reliable cloud‑native clusters.

benchmarkcloud-nativedistributed storage
0 likes · 17 min read
Mastering etcd: Architecture, Monitoring & Performance Tuning
Architecture & Thinking
Architecture & Thinking
Apr 10, 2024 · Operations

How Redis Sentinel Ensures Automatic Failover and High Availability

Redis Sentinel provides automatic monitoring, fault detection, and failover for Redis master‑slave clusters, enabling high availability by electing a new master when the original fails, using sdown/odown states, quorum voting, and pub/sub communication to keep services running with minimal downtime.

failoverhigh availabilitymonitoring
0 likes · 11 min read
How Redis Sentinel Ensures Automatic Failover and High Availability
Architect
Architect
Apr 8, 2024 · Backend Development

Mastering Batch Processing: Boost API Performance and Cut Overhead

This guide explains why batch processing is essential for API tuning and provides step‑by‑step techniques—including bulk database operations, request merging, pagination, parallel execution, caching, and monitoring—backed by concrete Java code samples and SQL queries to help engineers dramatically improve throughput and latency.

API optimizationBatch ProcessingCaching
0 likes · 33 min read
Mastering Batch Processing: Boost API Performance and Cut Overhead
Alibaba Cloud Native
Alibaba Cloud Native
Apr 8, 2024 · Cloud Native

How to Build a Global View for Multiple Prometheus Instances – Community and Alibaba Cloud Solutions

This article explains why a global view is needed when Prometheus metrics are scattered across many instances, compares community approaches such as Federation, Thanos, and Remote Write, and details Alibaba Cloud's Global Aggregation Instance and Remote Write solutions with configuration examples and a real‑world case study.

FederationGlobal ViewPrometheus
0 likes · 25 min read
How to Build a Global View for Multiple Prometheus Instances – Community and Alibaba Cloud Solutions
DevOps Operations Practice
DevOps Operations Practice
Apr 6, 2024 · Operations

Overview of Common DevOps Tools Used in Large Internet Companies

This article introduces the key DevOps tools—including CI/CD platforms, configuration‑management solutions, containerization technologies, monitoring and logging stacks, and infrastructure‑as‑code utilities—explaining their roles, features, and how they help streamline software delivery in modern enterprises.

Configuration ManagementContainerizationDevOps
0 likes · 9 min read
Overview of Common DevOps Tools Used in Large Internet Companies