Tagged articles
2195 articles
Page 8 of 22
DevOps
DevOps
Oct 26, 2023 · Operations

Design and Implementation of SLA for Object Storage Services

This article explains how to design SLA metrics for object storage services, describes the S3 protocol, proposes availability calculations, outlines monitoring and alerting rules, and provides practical implementation examples using s3cmd, Python boto, and Java SDK to ensure reliable cloud storage operations.

Object StorageSLAmonitoring
0 likes · 16 min read
Design and Implementation of SLA for Object Storage Services
HomeTech
HomeTech
Oct 25, 2023 · Operations

How Metrics‑Driven Development Supercharges a Used‑Car Platform

This article examines how a metrics‑driven development approach, combined with observability tools like Prometheus, helped a large online used‑car marketplace improve system insight, accelerate business processes, and deliver measurable performance and efficiency gains across both customer‑facing and dealer‑facing operations.

Data-Driven EngineeringMetrics-Driven DevelopmentObservability
0 likes · 16 min read
How Metrics‑Driven Development Supercharges a Used‑Car Platform
Efficient Ops
Efficient Ops
Oct 24, 2023 · Operations

How to Monitor Business Metrics with Prometheus in Kubernetes

This article explains how to use Prometheus to monitor business‑level metrics in a Kubernetes environment, covering observability fundamentals, metric definitions, metric types, exposing metrics via a /metrics endpoint, and practical Go code examples for defining, recording, and scraping custom metrics.

GoKubernetesMetrics
0 likes · 11 min read
How to Monitor Business Metrics with Prometheus in Kubernetes
Java High-Performance Architecture
Java High-Performance Architecture
Oct 22, 2023 · Backend Development

How DynamicTp Turns Java ThreadPoolExecutor into a Real‑Time, Configurable Powerhouse

This article introduces DynamicTp, a Java framework that extends ThreadPoolExecutor with dynamic configuration, real‑time monitoring, and alerting, enabling developers to adjust thread‑pool parameters on the fly, integrate with popular configuration centers, and achieve high‑availability and scalability in microservice environments.

Configuration CenterDynamic Thread PoolDynamicTp
0 likes · 12 min read
How DynamicTp Turns Java ThreadPoolExecutor into a Real‑Time, Configurable Powerhouse
DevOps Cloud Academy
DevOps Cloud Academy
Oct 18, 2023 · Operations

Comprehensive Overview of DevOps Tools for 2024

This article provides a detailed overview of the most widely used DevOps tools across categories such as version control, CI/CD, container orchestration, configuration management, infrastructure as code, monitoring, collaboration, artifact repositories, testing, security, deployment automation, serverless, and database management, helping practitioners choose the right solutions for their pipelines.

CollaborationDevOpsautomation
0 likes · 7 min read
Comprehensive Overview of DevOps Tools for 2024
Efficient Ops
Efficient Ops
Oct 15, 2023 · Databases

How to Diagnose and Fix Slow Redis Responses: A Step-by-Step Guide

This article walks through practical methods for troubleshooting slow service alerts, diagnosing Redis performance bottlenecks, and reproducing issues with local demos and load simulations, offering concrete metrics, command‑line checks, and mitigation strategies such as scaling, rate‑limiting, and pipeline optimization.

OperationsPerformanceRedis
0 likes · 22 min read
How to Diagnose and Fix Slow Redis Responses: A Step-by-Step Guide
JD Tech
JD Tech
Oct 13, 2023 · Operations

Implementing a Real-Time Pre-Alert Monitoring System to Improve Fund Trading System Stability

This article presents a practical pre‑alert monitoring solution for a high‑volume fund trading system, detailing how simple time‑based key‑point checks and targeted alerts reduce instant and end‑of‑day alarms, improve issue detection within 15 minutes, and enhance overall system stability and reconciliation efficiency.

fund‑tradingmonitoringpre‑alert
0 likes · 11 min read
Implementing a Real-Time Pre-Alert Monitoring System to Improve Fund Trading System Stability
JD Tech
JD Tech
Oct 11, 2023 · Fundamentals

Key Considerations for Building System Engineering Architecture: Design, Technology Selection, and Consensus

This article comprehensively discusses the essential aspects of constructing a system engineering architecture, emphasizing value‑first decision making, layered and DDD architectural patterns, technology selection criteria, exception handling, logging, monitoring, and the importance of establishing shared consensus among teams.

DDDException HandlingLayered Architecture
0 likes · 26 min read
Key Considerations for Building System Engineering Architecture: Design, Technology Selection, and Consensus
Liangxu Linux
Liangxu Linux
Oct 10, 2023 · Operations

Master Kibana: Install, Configure, and Visualize Elasticsearch Data Step‑by‑Step

This guide walks you through installing Kibana, configuring its connection to Elasticsearch, creating index patterns, using Discover for searches, mastering Lucene‑based query syntax, building visualizations, assembling dashboards, and monitoring logs, all illustrated with clear screenshots and code examples.

Data visualizationElasticsearchKibana
0 likes · 14 min read
Master Kibana: Install, Configure, and Visualize Elasticsearch Data Step‑by‑Step
Alibaba Cloud Native
Alibaba Cloud Native
Oct 10, 2023 · Operations

Mastering Memcached: Features, Use Cases, and Prometheus Monitoring

This article explains Memcached’s architecture, key characteristics, suitable and unsuitable scenarios, memory management and LRU mechanisms, version details, and provides a comprehensive guide to monitoring its performance and health using Prometheus and Alibaba Cloud ARMS dashboards.

CachingMemcachedOperations
0 likes · 26 min read
Mastering Memcached: Features, Use Cases, and Prometheus Monitoring
JD Tech
JD Tech
Oct 10, 2023 · Operations

Technical Case Study of JDV Visual Dashboard Platform for the 618 Promotion

This article details how JDV, JD.com’s internal visual dashboard platform, tackled the massive data‑intensive 618 promotion by implementing real‑time updates, cross‑midnight count stops, request‑state control, heartbeat monitoring, proxy data sources, and a suite of developer tools to ensure stability, performance, and rapid feature delivery.

Data Platformlarge scalemonitoring
0 likes · 18 min read
Technical Case Study of JDV Visual Dashboard Platform for the 618 Promotion
DataFunTalk
DataFunTalk
Oct 8, 2023 · Big Data

Full-Process DataOps Practices for Large-Scale Business Data Reporting at Baidu

This article reveals how Baidu implements end‑to‑end DataOps for its commercial data products, covering challenges of massive report generation, the design of a layered data architecture, platform‑wide automation, serverless deployment, risk control, monitoring, and optimization to achieve scalable, reliable data pipelines.

Big DataDataOpsOptimization
0 likes · 13 min read
Full-Process DataOps Practices for Large-Scale Business Data Reporting at Baidu
Efficient Ops
Efficient Ops
Sep 26, 2023 · Operations

Mastering Zabbix: From Installation to Advanced Monitoring and Automation

This comprehensive guide walks you through Zabbix monitoring concepts, reliability calculations, installation methods, web UI configuration, host and template management, custom monitoring, alert integration with OneAlert, Grafana visualization, distributed monitoring, SNMP support, and practical scripts for large‑scale server environments.

GrafanaSNMPalerting
0 likes · 28 min read
Mastering Zabbix: From Installation to Advanced Monitoring and Automation
Selected Java Interview Questions
Selected Java Interview Questions
Sep 24, 2023 · Operations

Comparison of Six Open-Source Log Management Tools

This article reviews six open‑source log management solutions—OpenObserve, Grafana Loki, SigNoz, Graylog, Syslog‑ng, and Highlight.io—detailing their features, advantages, and drawbacks to help engineers select the most suitable tool for observability, monitoring, and cost‑effective log handling.

Log Managementmonitoringopen-source
0 likes · 15 min read
Comparison of Six Open-Source Log Management Tools
Alibaba Cloud Native
Alibaba Cloud Native
Sep 24, 2023 · Cloud Computing

Designing Highly Available Cloud‑Native Applications on Alibaba Cloud ACK

This article explains how to build robust, highly available cloud‑native applications on Alibaba Cloud Container Service for Kubernetes (ACK) by covering architecture principles, multi‑zone cluster design, Kubernetes HA features such as topology spread constraints and pod anti‑affinity, storage strategies, load‑balancing, virtual nodes, health probes, monitoring, and multi‑cluster deployment patterns.

ACKKubernetesPod AntiAffinity
0 likes · 35 min read
Designing Highly Available Cloud‑Native Applications on Alibaba Cloud ACK
DevOps Coach
DevOps Coach
Sep 21, 2023 · Operations

What Is Observability (o11y) and Why It Matters for Modern Cloud‑Native Operations

The article explains the origins, common misconceptions, and a rigorous definition of observability (o11y), highlights its importance in cloud‑native environments, and describes how high‑cardinality, high‑dimensional telemetry enables effective debugging, troubleshooting, and performance analysis of modern distributed systems.

Debuggingcloud-nativemonitoring
0 likes · 11 min read
What Is Observability (o11y) and Why It Matters for Modern Cloud‑Native Operations
Architect
Architect
Sep 19, 2023 · Big Data

How Tianyan Beats ELK: Inside a High‑Performance Distributed Log Service

This article analyzes the challenges of logging in distributed services, compares the traditional ELK stack with Baidu's Tianyan platform, and details Tianyan's architecture, data collection, high‑throughput transmission, storage, retrieval, resource isolation, dynamic cleanup, and best‑practice recommendations, complete with code examples and performance insights.

Big DataDistributed SystemsELK
0 likes · 30 min read
How Tianyan Beats ELK: Inside a High‑Performance Distributed Log Service
Zhuanzhuan Tech
Zhuanzhuan Tech
Sep 19, 2023 · Operations

Design and Implementation of an Integrated Monitoring System at ZhaiZhai Using Prometheus, Grafana, and M3DB

This article describes how ZhaiZhai unified dozens of legacy monitoring tools into a single, all‑in‑one observability platform by adopting Prometheus + Grafana, extending the Prometheus client to push metrics to M3DB, automating Grafana dashboard creation, and building a custom alerting service to reduce operational complexity and improve visibility across business, middleware, and infrastructure services.

GrafanaM3DBObservability
0 likes · 21 min read
Design and Implementation of an Integrated Monitoring System at ZhaiZhai Using Prometheus, Grafana, and M3DB
DaTaobao Tech
DaTaobao Tech
Sep 18, 2023 · Databases

Comprehensive Approach to Slow SQL Detection and Governance

The Taobao platform’s slow‑SQL governance team implemented a comprehensive detection and governance pipeline—combining internal slow‑log tools, database slow‑query logs, and JVM‑Sandbox instrumentation to capture full SQL details, scoring high‑risk queries by execution time, scans, and standards violations, then prioritizing remediation through health scores, branch‑diff checks, and issue tracking—significantly cutting DB‑related incidents and boosting system stability.

DatabaseGovernancePerformance
0 likes · 12 min read
Comprehensive Approach to Slow SQL Detection and Governance
Huolala Tech
Huolala Tech
Sep 14, 2023 · Operations

Designing an Effective UI for Monitoring Alerts: Insights from Huolala

This article shares Huolala's experience designing a unified monitoring platform UI, covering the evolution from open‑source dashboards to a fully self‑developed solution, simplification of PromQL, computed metrics, log and trace integration, and the challenges of alert configuration and visualization.

ObservabilityOperationsPrometheus
0 likes · 16 min read
Designing an Effective UI for Monitoring Alerts: Insights from Huolala
IT Services Circle
IT Services Circle
Sep 14, 2023 · Backend Development

Key Techniques for Designing High‑Concurrency Systems

This article outlines essential architectural and operational strategies—including page static‑generation, CDN acceleration, caching layers, asynchronous processing, thread‑pool and MQ integration, sharding, connection pooling, read/write splitting, indexing, batch processing, clustering, load balancing, rate limiting, service degradation, failover, multi‑active deployment, stress testing, and monitoring—to build robust, high‑concurrency backend systems.

Backend ArchitectureCachinghigh concurrency
0 likes · 23 min read
Key Techniques for Designing High‑Concurrency Systems
MaGe Linux Operations
MaGe Linux Operations
Sep 13, 2023 · Cloud Native

Mastering Prometheus Metrics: Counters, Gauges, Histograms & Summaries Explained

This article introduces the fundamentals of metrics in IT monitoring, explains the structure of metric data points, explores dimensional metrics, and provides an in‑depth guide to Prometheus metric types—Counters, Gauges, Histograms, and Summaries—along with practical code examples and usage considerations in cloud‑native environments.

MetricsPrometheusmonitoring
0 likes · 19 min read
Mastering Prometheus Metrics: Counters, Gauges, Histograms & Summaries Explained
JD Cloud Developers
JD Cloud Developers
Sep 13, 2023 · Operations

Stability Engineering Explained: From Entropy Theory to Practical SRE

The article explores why building system stability is crucial by linking entropy theory to software reliability, introduces the availability formula, discusses common pitfalls and industry practices, and proposes a three‑stage governance framework—prevention, mitigation, and post‑mortem—to systematically improve operational resilience.

AvailabilityOperationsReliability
0 likes · 13 min read
Stability Engineering Explained: From Entropy Theory to Practical SRE
Efficient Ops
Efficient Ops
Sep 12, 2023 · Operations

Understanding Prometheus Metric Types: Counters, Gauges, Histograms & Summaries

This article explains how metrics are used to monitor software performance, introduces basic metric components and dimensional metrics, compares Prometheus, OpenMetrics and OpenTelemetry standards, and provides detailed guidance on Prometheus metric types—Counter, Gauge, Histogram, and Summary—with code examples and query patterns.

MetricsObservabilityPrometheus
0 likes · 18 min read
Understanding Prometheus Metric Types: Counters, Gauges, Histograms & Summaries
Didi Tech
Didi Tech
Sep 12, 2023 · Operations

Observability: Concepts, Challenges, and Didi’s Implementation

The article explains observability as the ability to infer any system state from external data, contrasts it with traditional monitoring, outlines challenges of high‑dimensional, high‑cardinality data and storage costs, and describes Didi’s hybrid MTL architecture that separates low‑ and high‑cardinality logs and metrics while linking them via TraceIDs to provide detailed, cost‑effective insight and streamlined debugging.

DidiLoggingTracing
0 likes · 9 min read
Observability: Concepts, Challenges, and Didi’s Implementation
Architect
Architect
Sep 7, 2023 · Cloud Native

How Vivo Scaled Container Monitoring with Prometheus, Kafka, and VictoriaMetrics

This article details how Vivo's container platform faced exploding metric volumes, component overload, data gaps, and storage spikes, and explains the step‑by‑step architectural redesign, metric governance, performance tuning, cAdvisor redeployment, and VictoriaMetrics upgrade that restored high‑availability, low‑latency monitoring across a large Kubernetes fleet.

KubernetesObservabilityPrometheus
0 likes · 18 min read
How Vivo Scaled Container Monitoring with Prometheus, Kafka, and VictoriaMetrics
MaGe Linux Operations
MaGe Linux Operations
Sep 2, 2023 · Operations

Top 5 Linux Monitoring Tools Every Ops Engineer Should Use

This article introduces five essential Linux monitoring tools—iotop, htop, IPTraf, Monit, and related resources—explaining how each helps operations engineers diagnose I/O, CPU, memory, and network issues in real time without a GUI, and offers guidance on installation and practical use cases.

IPTrafLinuxMonit
0 likes · 6 min read
Top 5 Linux Monitoring Tools Every Ops Engineer Should Use
dbaplus Community
dbaplus Community
Aug 30, 2023 · Operations

How Weibo Scales to Hundreds of Millions: Building a Resilient Hybrid‑Cloud Architecture

This article outlines Weibo's massive user‑scale challenges and presents a comprehensive high‑availability solution that combines capacity planning, distributed caching, micro‑service isolation, cross‑language RPC, service‑mesh governance, multi‑datacenter disaster recovery, containerization, and hybrid‑cloud scaling to ensure reliable service delivery.

Service Meshhybrid cloudmicroservices
0 likes · 15 min read
How Weibo Scales to Hundreds of Millions: Building a Resilient Hybrid‑Cloud Architecture
High Availability Architecture
High Availability Architecture
Aug 30, 2023 · Backend Development

Diagnosing and Optimizing JVM Memory Issues in a Core Service

This article details the identification, analysis, and resolution of JVM memory problems in a core music metadata service, covering GC tuning, large‑object handling, fault‑tolerance strategies, custom Dubbo codec monitoring, and non‑intrusive memory object tracking to improve performance and stability.

DubboJVMMemory Optimization
0 likes · 14 min read
Diagnosing and Optimizing JVM Memory Issues in a Core Service
DeWu Technology
DeWu Technology
Aug 28, 2023 · Operations

Real-time Data Warehouse Business-Side Chaos Engineering Practice

The article describes how a real‑time data warehouse supporting ad‑delivery metrics adopts both technical and business‑side chaos‑engineering, using red‑blue team drills to inject faults, monitor indicator anomalies, and refine response procedures, thereby enhancing early risk detection, system resilience, and overall data stability for the advertising platform.

Backend DevelopmentData QualityData Warehousing
0 likes · 16 min read
Real-time Data Warehouse Business-Side Chaos Engineering Practice
JD Retail Technology
JD Retail Technology
Aug 24, 2023 · Operations

High‑Availability Strategies for E‑commerce Large‑Scale Promotion Systems

This article outlines a comprehensive framework for preparing e‑commerce platforms for major sales events, covering the history of promotions, business models, system chain segmentation, stability goals, strategic planning, tactical measures, growth promotion, and reference resources to ensure high availability and reliable user experience.

e‑commercehigh availabilitylarge‑scale promotion
0 likes · 19 min read
High‑Availability Strategies for E‑commerce Large‑Scale Promotion Systems
Sohu Tech Products
Sohu Tech Products
Aug 23, 2023 · Operations

Implementing Global Pulsar Client Monitoring with a SkyWalking Plugin

To give the business team a global, application‑level view of Pulsar performance, the team built a SkyWalking Java‑Agent plugin that automatically collects producer and consumer metrics from the Pulsar client, exposing latency, backlog and failure counts via Prometheus without modifying the client code.

MetricsPrometheusPulsar
0 likes · 7 min read
Implementing Global Pulsar Client Monitoring with a SkyWalking Plugin
Efficient Ops
Efficient Ops
Aug 23, 2023 · Operations

How to Diagnose High Load with Low CPU on Linux: Tools & Tips

This guide explains how to analyze Linux load situations—whether CPU and load are both high or CPU is low while load remains high—by using commands like top, vmstat, iostat, sar, and jstack, and provides practical troubleshooting steps for common I/O‑related issues.

CPULoadOperations
0 likes · 11 min read
How to Diagnose High Load with Low CPU on Linux: Tools & Tips
dbaplus Community
dbaplus Community
Aug 22, 2023 · Operations

Designing a Multi‑Cloud Intelligent Monitoring Platform at Huolala: Architecture, Practices, and Future Directions

This article details Huolala's one‑stop monitoring platform called Monitor, covering its multi‑cloud architecture, data collection pipelines, real‑time business monitoring, unified alarm handling, and future AI‑driven enhancements, while sharing concrete metrics, incident case studies, and practical implementation steps for large‑scale observability.

GPTObservabilityOperations
0 likes · 19 min read
Designing a Multi‑Cloud Intelligent Monitoring Platform at Huolala: Architecture, Practices, and Future Directions
Efficient Ops
Efficient Ops
Aug 22, 2023 · Operations

Persisting Prometheus Alertmanager Alerts with Alertsnitch, MySQL, and Grafana

This article explains how Prometheus stores alerts only as time‑series data, why that limits historical queries, and provides a complete open‑source solution using Alertmanager, Alertsnitch, MySQL, and Grafana to persist, query, and visualize alerts in production environments.

Alert PersistenceAlertmanagerGrafana
0 likes · 10 min read
Persisting Prometheus Alertmanager Alerts with Alertsnitch, MySQL, and Grafana
Huolala Tech
Huolala Tech
Aug 18, 2023 · Operations

Beyond System Metrics: Building Effective Business Monitoring for Pricing Services

Facing unpredictable software behavior, the article explains why traditional system‑level monitoring often misses critical business issues, especially in complex pricing services, and presents a comprehensive approach that combines result (black‑box) and process (white‑box) monitoring, practical metrics, and actionable recommendations to improve observability and reduce operational risk.

ObservabilityOperationsbusiness metrics
0 likes · 14 min read
Beyond System Metrics: Building Effective Business Monitoring for Pricing Services
dbaplus Community
dbaplus Community
Aug 14, 2023 · Operations

Designing Business‑Focused Monitoring for Banking Systems: Metrics, Alerts, and Implementation Challenges

The article outlines a practical framework for business‑level monitoring in banking systems, describing three evolution stages, key metrics such as transaction success rates and volume spikes, concrete alert rules, and the technical challenges of data collection, standardization, and massive parameter management.

MetricsOperationsalerting
0 likes · 14 min read
Designing Business‑Focused Monitoring for Banking Systems: Metrics, Alerts, and Implementation Challenges
DeWu Technology
DeWu Technology
Aug 14, 2023 · Operations

Capital Loss Prevention Practices and Technical System

Dewu’s capital‑loss prevention framework embeds risk assessment and technical safeguards—such as idempotency, distributed consistency, and active‑active multi‑region design—into architecture, organizes three defensive lines (development, QA, SRE), and employs real‑time, near‑real‑time, and offline verification plus regular drills, while advancing automated analysis and intelligent scaling.

Data ConsistencySREfinancial loss prevention
0 likes · 10 min read
Capital Loss Prevention Practices and Technical System
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Aug 10, 2023 · Operations

How Kubernetes Powers Modern DevOps Automation and Operations

By integrating Kubernetes with DevOps practices, teams can automate deployment pipelines, achieve dynamic resource allocation, centralize monitoring with tools like Prometheus and Grafana, and treat infrastructure as code, resulting in faster, higher-quality software delivery and improved collaboration between development and operations.

DevOpsKubernetesOperations
0 likes · 7 min read
How Kubernetes Powers Modern DevOps Automation and Operations
Ctrip Technology
Ctrip Technology
Aug 3, 2023 · Operations

Intelligent Anomaly Detection for Ctrip Operations: LSTM Forecasting, Trend Analysis, Adaptive Thresholds, and Periodic Anomaly Filtering

The article describes Ctrip's AIOps approach to improving alert quality by combining statistical methods and machine‑learning models such as LSTM, trend analysis, adaptive threshold calculation, and dynamic‑time‑warping based periodic anomaly detection, achieving significant gains in precision and fault‑recall rates.

Anomaly DetectionLSTMTime-series
0 likes · 12 min read
Intelligent Anomaly Detection for Ctrip Operations: LSTM Forecasting, Trend Analysis, Adaptive Thresholds, and Periodic Anomaly Filtering
HelloTech
HelloTech
Aug 1, 2023 · Cloud Native

Elastic Scaling Practices in Cloud‑Native Kubernetes Environments

To overcome native HPA limits and business‑specific constraints in a fully containerized, cloud‑native Kubernetes environment, we implemented a dual‑threshold water‑level and scheduled scaling engine, hybrid‑cloud ClusterAutoScale, mixed‑deployment resource prioritization, and comprehensive Prometheus‑based observability, achieving higher utilization, lower costs, and a roadmap toward deeper optimization and AIOps.

Auto ScalingKubernetescloud-native
0 likes · 10 min read
Elastic Scaling Practices in Cloud‑Native Kubernetes Environments
Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Jul 27, 2023 · Backend Development

How to Set Up and Secure Spring Boot Admin Server & Client with Dynamic Logging

This guide walks through setting up a Spring Boot Admin server and client, adding security, configuring logging, displaying client IPs, and dynamically adjusting log levels via the SBA UI, providing complete Maven dependencies, Java configuration classes, and YAML settings for a secure, observable Spring Boot ecosystem.

LoggingSecuritySpring Boot
0 likes · 9 min read
How to Set Up and Secure Spring Boot Admin Server & Client with Dynamic Logging
Open Source Linux
Open Source Linux
Jul 27, 2023 · Operations

17 Essential Linux Ops Tricks to Boost Your Productivity

This article compiles seventeen practical Linux administration techniques—from batch file handling and directory checks to log analysis, disk monitoring, firewall rules, and network capture—each illustrated with ready‑to‑run shell commands and concise explanations for sysadmins.

automationfirewallmonitoring
0 likes · 8 min read
17 Essential Linux Ops Tricks to Boost Your Productivity
Tech Architecture Stories
Tech Architecture Stories
Jul 23, 2023 · Backend Development

Beyond Scale: Rethinking Architecture Boundaries for Massive Services

This article reflects on years of designing large‑scale backend systems at Tencent, discussing how to define clear architecture boundaries, ensure high availability, integrate diverse technologies, and use observability and monitoring to continuously evolve and improve massive service architectures.

Distributed SystemsObservabilitySystem Design
0 likes · 25 min read
Beyond Scale: Rethinking Architecture Boundaries for Massive Services
Liangxu Linux
Liangxu Linux
Jul 22, 2023 · Operations

17 Essential Linux Sysadmin Commands to Boost Productivity

This article compiles 17 practical Linux operation tricks—from file searching and batch extraction to disk monitoring, log analysis, and firewall scripting—providing sysadmins with ready-to-use command snippets that can streamline daily tasks and potentially earn a raise.

BashScriptingautomation
0 likes · 8 min read
17 Essential Linux Sysadmin Commands to Boost Productivity
Test Development Learning Exchange
Test Development Learning Exchange
Jul 18, 2023 · Operations

Common System Performance Issues and Their Solutions

This article enumerates typical system performance problems such as slow response time, insufficient throughput, resource bottlenecks, database slowness, memory leaks, platform differences, network latency, security overhead, scalability limits, and provides practical optimization and mitigation strategies for each.

ScalabilitySystemsTroubleshooting
0 likes · 7 min read
Common System Performance Issues and Their Solutions
Test Development Learning Exchange
Test Development Learning Exchange
Jul 17, 2023 · Operations

Comprehensive Guide to Performance Testing Parameters, Metrics, and Tool Selection

This article explains key performance testing parameters such as concurrent users, TPS, response time, virtual users, and data volume, outlines essential monitoring metrics, details preparation steps and simple API testing procedures, and compares popular load‑testing tools like JMeter, Locust, and LoadRunner.

Response Timemonitoringperformance testing
0 likes · 12 min read
Comprehensive Guide to Performance Testing Parameters, Metrics, and Tool Selection
21CTO
21CTO
Jul 17, 2023 · Big Data

How WeChat Cut Query Latency from Seconds to 100 ms with Druid Optimizations

This case study explains how the WeChat multi‑dimensional monitoring platform identified performance bottlenecks in its Druid‑based data layer, analyzed user query patterns, and applied sub‑query splitting, Redis caching, and segment size reductions to achieve over 85% cache‑hit rates and bring average query latency down to around 100 ms.

Big DataCachingDruid
0 likes · 13 min read
How WeChat Cut Query Latency from Seconds to 100 ms with Druid Optimizations
Liangxu Linux
Liangxu Linux
Jul 16, 2023 · Operations

Essential Ops Checklist: Prevent Data Loss, Secure Servers, and Optimize Performance

This article compiles practical operations guidelines covering safe testing, rigorous confirmation before commands, limiting multi‑person access, mandatory backups, careful use of destructive commands, SSH hardening, firewall rules, fine‑grained permissions, continuous monitoring, performance tuning steps, and a disciplined mindset to avoid costly incidents.

Performance tuningSecuritybackup
0 likes · 10 min read
Essential Ops Checklist: Prevent Data Loss, Secure Servers, and Optimize Performance
Qunar Tech Salon
Qunar Tech Salon
Jul 12, 2023 · Operations

Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis

This article describes Qunar's comprehensive root cause analysis platform, detailing its background, data-driven fault categorization, architecture—including trace, runtime, middleware, and event analysis modules—and demonstrates its high accuracy and practical impact on reducing incident resolution times across microservice services.

DevOpsObservabilityOperations
0 likes · 20 min read
Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis
Didi Tech
Didi Tech
Jul 11, 2023 · Operations

DevOps Practices and Challenges at Didi Ride‑Hailing: From Development to Operations

Didi’s ride‑hailing R&D team addresses efficiency and stability challenges of a large micro‑service ecosystem by unifying a Go stack, common framework, and data models, using eBPF traffic recording for automated regression testing, and applying AIOps alert filtering, knowledge‑graph root‑cause analysis, and a localization robot for rapid fault recovery, while targeting full CI/CD automation with static analysis, service‑mesh observability, and chaos engineering.

CloudNativeaiopsmicroservices
0 likes · 22 min read
DevOps Practices and Challenges at Didi Ride‑Hailing: From Development to Operations
JD Retail Technology
JD Retail Technology
Jul 11, 2023 · Operations

Technical Strategies for Ensuring System Stability During the 618 Promotion

The article analyzes the importance of the 618 sales event, identifies factors that threaten system stability such as traffic spikes, massive data, complex scenarios, long delivery chains and low tolerance, and proposes comprehensive application, storage, and operational measures—including unitization, monitoring, logging, fast‑fail, rate‑limiting, degradation, database and cache designs, and emergency processes—to guarantee reliable service during the promotion.

Scalabilityhigh availabilitylarge‑scale promotion
0 likes · 14 min read
Technical Strategies for Ensuring System Stability During the 618 Promotion
Beijing SF i-TECH City Technology Team
Beijing SF i-TECH City Technology Team
Jul 10, 2023 · Mobile Development

Mobile Application Quality System – Standard Operating Procedure (SOP)

This document outlines a comprehensive Standard Operating Procedure for building and maintaining a mobile application quality system, covering background, pre‑emptive planning, coding standards, branch management, code review, AI‑assisted tools, monitoring, issue handling, and continuous improvement to ensure stable, high‑quality mobile products.

AI toolsMobileSOP
0 likes · 27 min read
Mobile Application Quality System – Standard Operating Procedure (SOP)
DataFunTalk
DataFunTalk
Jul 9, 2023 · Operations

Building High‑Performance Observability Data Pipelines with Vector and Honghu

This article explains the concepts and importance of observability, introduces the Vector data‑pipeline tool and its architecture, demonstrates how to configure sources, transforms and sinks, and shows how to integrate Vector with the Honghu platform to build a complete, real‑time monitoring solution for modern distributed systems.

Big DataHonghuObservability
0 likes · 33 min read
Building High‑Performance Observability Data Pipelines with Vector and Honghu
Liangxu Linux
Liangxu Linux
Jul 9, 2023 · Backend Development

From Monolith to Microservices: A Practical Evolution Blueprint

This article walks through the step‑by‑step transformation of a simple online supermarket from a single‑node monolith to a fully fledged microservice architecture, highlighting the motivations, common pitfalls, component choices, monitoring, tracing, logging, resilience patterns, testing strategies, and the trade‑offs of frameworks versus service mesh.

Backend ArchitectureService Meshdistributed tracing
0 likes · 24 min read
From Monolith to Microservices: A Practical Evolution Blueprint
DevOps Cloud Academy
DevOps Cloud Academy
Jul 9, 2023 · Cloud Native

Designing Scalable Kubernetes Applications: Best Practices

This article outlines comprehensive best‑practice guidelines for building Kubernetes applications, covering scalability design, containerization, pod scope, configuration management, health probes, deployments, service discovery, storage, monitoring, security, and CI/CD integration to achieve robust, highly available workloads.

ConfigMapKubernetesSecurity
0 likes · 9 min read
Designing Scalable Kubernetes Applications: Best Practices
Open Source Linux
Open Source Linux
Jul 4, 2023 · Operations

Master Redis Monitoring, Migration, and Cluster Management with Prometheus and CacheCloud

This guide walks through essential Redis operations, covering real‑time monitoring with the INFO command and Prometheus‑compatible exporters, data migration using Redis‑shake, consistency verification via Redis‑full‑check, and comprehensive cluster management with CacheCloud, providing practical tools for reliable Redis administration.

Data MigrationOperationsPrometheus
0 likes · 11 min read
Master Redis Monitoring, Migration, and Cluster Management with Prometheus and CacheCloud
ITPUB
ITPUB
Jun 30, 2023 · Operations

How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook

This article details Tencent Search’s end‑to‑end stability engineering framework, covering a layered reliability architecture, disaster‑recovery mechanisms, fast detection and monitoring, emergency response acceleration, pre‑release interception, automated defense, and collaborative governance that together improve MTTD and MTTR by an order of magnitude.

Reliabilityautomationdisaster-recovery
0 likes · 30 min read
How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook
Open Source Linux
Open Source Linux
Jun 30, 2023 · Cloud Native

Essential Kubernetes Tools to Boost Your DevOps Workflow

This article reviews a curated set of open‑source Kubernetes tools—including Helm, Flagger, Kubewatch, Gitkube, kube‑state‑metrics, Kamus, Untrak, Scope, Dashboard, Kops, cAdvisor, Kubespray, K9s, Kubetail, PowerfulSeal, and Popeye—that enhance management, security, monitoring, and deployment within DevOps pipelines.

Securitycloud-nativemonitoring
0 likes · 11 min read
Essential Kubernetes Tools to Boost Your DevOps Workflow
dbaplus Community
dbaplus Community
Jun 28, 2023 · Operations

Identify and Fix System Performance Bottlenecks: Key Metrics and Optimization

The article outlines common system performance bottlenecks such as CPU, memory, disk I/O, network, exceptions, and databases, explains how to measure response time, TPS, and resource utilization, and provides a step‑by‑step bottom‑up and top‑down approach for testing, diagnosing, and optimizing Java‑based services.

Optimizationbottleneckmonitoring
0 likes · 11 min read
Identify and Fix System Performance Bottlenecks: Key Metrics and Optimization
Top Architect
Top Architect
Jun 27, 2023 · Databases

Redis Performance Degradation: Root Causes and Optimization Techniques

This article explains how to benchmark Redis latency, identify common reasons for slowdowns such as high‑complexity commands, big keys, concentrated expirations, memory limits, fork overhead, swap usage, and CPU binding, and provides detailed configuration and operational steps to monitor and resolve each issue.

AOFLatencyMemory
0 likes · 34 min read
Redis Performance Degradation: Root Causes and Optimization Techniques
Programmer DD
Programmer DD
Jun 26, 2023 · Operations

What’s New in Grafana 10? Explore Correlations, Scenes, and Powerful New Panels

Grafana 10 introduces a suite of enhancements—including Correlations for cross‑data‑source linking, the Scenes front‑end library for building stunning dashboards, new Canvas, Trends, and Datagrid panels, CSV drag‑and‑drop support, sub‑folder organization, and improved data‑source selection—aimed at boosting analysis, collaboration, and efficiency for monitoring teams.

GrafanaNew FeaturesOperations
0 likes · 7 min read
What’s New in Grafana 10? Explore Correlations, Scenes, and Powerful New Panels
Architect
Architect
Jun 23, 2023 · Big Data

Optimizing Query Performance in WeChat's Multi‑Dimensional Monitoring Platform

This article details how the WeChat multi‑dimensional monitoring platform reduced average query latency from over 1000 ms to around 100 ms by analyzing user query patterns, redesigning the Druid data layer, splitting sub‑queries, introducing Redis caching, and employing sub‑dimension tables, achieving cache hit rates above 85%.

DruidPerformanceWeChat
0 likes · 13 min read
Optimizing Query Performance in WeChat's Multi‑Dimensional Monitoring Platform
MaGe Linux Operations
MaGe Linux Operations
Jun 22, 2023 · Cloud Native

Essential Open‑Source Kubernetes Tools to Supercharge Your DevOps

This article surveys a curated collection of open‑source Kubernetes utilities—including Helm, Flagger, Kubewatch, Gitkube, kube‑state‑metrics, Kamus, Untrak, Scope, Dashboard, Kops, cAdvisor, Kubespray, K9s, Kubetail, PowerfulSeal and Popeye—detailing their roles in deployment, monitoring, security, and cluster management for modern DevOps workflows.

KubernetesSecuritymonitoring
0 likes · 15 min read
Essential Open‑Source Kubernetes Tools to Supercharge Your DevOps
Open Source Linux
Open Source Linux
Jun 21, 2023 · Cloud Native

From Monolith to Microservices: A Real‑World Journey and Lessons Learned

An online supermarket startup evolves its simple monolithic website into a fully distributed microservice architecture, detailing each transformation stage, the challenges encountered—such as code duplication, database bottlenecks, deployment complexity—and the solutions like service decomposition, monitoring, tracing, circuit breaking, and service mesh.

Circuit BreakerService Meshdistributed-systems
0 likes · 23 min read
From Monolith to Microservices: A Real‑World Journey and Lessons Learned
Baidu Geek Talk
Baidu Geek Talk
Jun 19, 2023 · Operations

How Baidu’s Tianyan Log Service Overcomes ELK’s Scaling and Performance Limits

This article examines the challenges of logging in distributed services, compares the traditional ELK stack with Baidu's Tianyan solution, details Tianyan's architecture—including Ingest, Store, Consumer, Elastic Agent, Fleet, APM, Beats, and Disruptor‑based high‑throughput pipelines—covers resource isolation, dynamic cleanup, and best‑practice recommendations for building a scalable, low‑latency log platform.

Distributed SystemsElastic StackLog Management
0 likes · 26 min read
How Baidu’s Tianyan Log Service Overcomes ELK’s Scaling and Performance Limits
vivo Internet Technology
vivo Internet Technology
Jun 14, 2023 · Backend Development

Stability Practices for Vivo Account System: Service Governance, Data Architecture, and Monitoring

Vivo’s account platform, serving 270 million users and over 100 billion daily requests, achieves high‑performance stability through disciplined service splitting, hierarchical dependency control, layered caching and sharding strategies, and comprehensive multi‑layer monitoring that together ensure scalability, availability, and rapid fault diagnosis.

BackendCachingDatabase
0 likes · 24 min read
Stability Practices for Vivo Account System: Service Governance, Data Architecture, and Monitoring
JD Cloud Developers
JD Cloud Developers
Jun 14, 2023 · Operations

How to Ensure System Stability During Mega Sales Events like 618

This article examines the technical and operational challenges of the 618 shopping festival, presenting data‑driven insights and detailed strategies—including modular deployment, monitoring, logging, fast‑failure, rate limiting, database and cache optimizations, and emergency response plans—to help teams maintain system stability under massive traffic spikes.

OperationsScalabilitylarge‑scale promotion
0 likes · 13 min read
How to Ensure System Stability During Mega Sales Events like 618
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Jun 12, 2023 · Frontend Development

Design and Architecture of Corona: NetEase Cloud Music Multi‑Platform Front‑End Monitoring System

Corona is NetEase Cloud Music’s unified, cross‑platform front‑end monitoring system that ingests logs from Web, React Native, Node.js, Android, iOS, Flutter and Windows CEF, enriches them, routes them through real‑time anomaly and performance pipelines, stores them in HBase, and offers customizable alerts, de‑obfuscation, AI‑assisted analysis, and extensible reporting to ensure rapid fault detection and remediation across the organization.

FrontendLoggingPerformance
0 likes · 17 min read
Design and Architecture of Corona: NetEase Cloud Music Multi‑Platform Front‑End Monitoring System
DevOps Operations Practice
DevOps Operations Practice
Jun 11, 2023 · Operations

Practical Linux Administration Tools for System Monitoring and Management

This article presents a curated list of useful Linux command‑line tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, HTop, NMON, MultiTail, Fail2ban, Tmux, Agedu, NMap and Httperf—along with installation commands and brief usage notes to help system administrators monitor performance, security and resources effectively.

Linuxmonitoringtools
0 likes · 12 min read
Practical Linux Administration Tools for System Monitoring and Management
Tencent Cloud Developer
Tencent Cloud Developer
Jun 8, 2023 · Operations

Stability Governance in Tencent Search: Architecture, Incident Management, and Automation

The article outlines Tencent Search’s stability governance, detailing a multi‑layered availability architecture, disaster‑recovery mechanisms, precise monitoring, rapid emergency workflows, pre‑release interception, extensive automation, and a collaborative governance model that together enhance system resilience, incident detection, and swift remediation.

Incident Responseavailability architecturemonitoring
0 likes · 28 min read
Stability Governance in Tencent Search: Architecture, Incident Management, and Automation
Efficient Ops
Efficient Ops
Jun 7, 2023 · Artificial Intelligence

How Guangdong Mobile Scaled AIOps: From Manual Ops to Intelligent Automation

This article details Guangdong Mobile's evolution of IT systems and operations, explains the four domain architecture, chronicles the AIOps adoption timeline, showcases intelligent anomaly detection, change assessment, fault diagnosis, and operation robots, and shares practical promotion methods and future outlook for AI‑driven IT operations.

Artificial IntelligenceFault DiagnosisIT Operations
0 likes · 19 min read
How Guangdong Mobile Scaled AIOps: From Manual Ops to Intelligent Automation
JD Tech
JD Tech
Jun 7, 2023 · Operations

Practical Guide to Achieving High Availability in Software Delivery

This article explains the concept of high availability, outlines the challenges of collaborative delivery, architectural design, coding practices, secure release, and deployment operations, and provides concrete steps, process standards, emergency plans, and self‑check tools to ensure reliable, fault‑tolerant software systems.

Collaborationarchitecturedeployment
0 likes · 13 min read
Practical Guide to Achieving High Availability in Software Delivery