Tagged articles
2195 articles
Page 7 of 22
Wukong Talks Architecture
Wukong Talks Architecture
Apr 4, 2024 · Operations

Cloud Stability Governance: Frontend and Backend Strategies, Deployment, and Monitoring Practices

This article presents a comprehensive view of cloud stability governance from both front‑end and back‑end perspectives, detailing system architecture, micro‑frontend integration, CI/CD deployment pipelines, SLB forwarding and health‑check configurations, monitoring dashboards, UI automation testing, and the resulting operational improvements.

SLBStabilityci/cd
0 likes · 13 min read
Cloud Stability Governance: Frontend and Backend Strategies, Deployment, and Monitoring Practices
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Apr 1, 2024 · Industry Insights

Why Shifting Testing Left Boosts Quality: Lessons from Cloud Music

The article analyzes the concept of test left‑shift, outlining its theoretical benefits and drawbacks, sharing practical pain points from NetEase Cloud Music, and presenting a comprehensive pre‑, during‑, and post‑shift automation and monitoring strategy to improve software quality and delivery speed.

DevOpsautomationmonitoring
0 likes · 13 min read
Why Shifting Testing Left Boosts Quality: Lessons from Cloud Music
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 1, 2024 · Operations

How We Achieved End-to-End Cloud Stability with Micro Frontends and Automated Deployments

This article details a comprehensive, front‑and‑back‑end approach to cloud stability, covering system architecture across private and public clouds, micro‑frontend integration, CI/CD pipelines, SLB routing, health‑check configurations, monitoring dashboards, data reconciliation, UI automation testing, and the resulting improvements in observability, gray‑release, rollback, and incident reduction.

Micro FrontendsSLBautomation
0 likes · 14 min read
How We Achieved End-to-End Cloud Stability with Micro Frontends and Automated Deployments
Efficient Ops
Efficient Ops
Mar 31, 2024 · Operations

Why Most Alerts Fail and How to Design Actionable Monitoring

Most system alerts are poorly designed, flooding engineers with noise; this article explains the essence of alerts, distinguishes business rule vs reliability monitoring, outlines effective metrics and strategies, and presents simple anomaly-detection algorithms to create actionable, high-quality alerts.

Anomaly Detectionalert designmonitoring
0 likes · 21 min read
Why Most Alerts Fail and How to Design Actionable Monitoring
Architecture Digest
Architecture Digest
Mar 28, 2024 · Operations

A Comprehensive Overview of Monitoring Systems: Fundamentals, Popular Open‑Source Solutions, and Selection Guidance

This article systematically introduces monitoring fundamentals, core concepts, and architecture, then reviews three widely used open‑source monitoring tools—Zabbix, Open‑Falcon, and Prometheus—detailing their components, advantages, disadvantages, and provides practical advice for selecting the most suitable solution.

Open-FalconOperationsmonitoring
0 likes · 17 min read
A Comprehensive Overview of Monitoring Systems: Fundamentals, Popular Open‑Source Solutions, and Selection Guidance
Efficient Ops
Efficient Ops
Mar 27, 2024 · Operations

Master System Monitoring with the USE Method and Prometheus

This article explains how to design a comprehensive monitoring system using the concise USE (Utilization, Saturation, Errors) method, outlines essential system and application metrics, and demonstrates practical implementation with Prometheus, Grafana, and related open‑source tools.

PrometheusUSE methodmonitoring
0 likes · 14 min read
Master System Monitoring with the USE Method and Prometheus
ITPUB
ITPUB
Mar 27, 2024 · Backend Development

How Instagram Scaled to 14 Million Users with Just Three Engineers

This article details how Instagram grew from zero to 14 million users in just over a year using three engineers by applying three core principles and a reliable AWS‑based tech stack covering frontend, load balancing, backend, PostgreSQL sharding, S3 storage, Redis caching, asynchronous task queues, and comprehensive monitoring.

BackendPostgreSQLRedis
0 likes · 9 min read
How Instagram Scaled to 14 Million Users with Just Three Engineers
DevOps Operations Practice
DevOps Operations Practice
Mar 25, 2024 · Operations

How to Monitor MySQL with Prometheus and Grafana

This tutorial explains how to install the MySQL Exporter, configure Prometheus to scrape MySQL metrics, set up Grafana dashboards for visualization, and define alerting rules for common MySQL performance indicators, providing a complete end‑to‑end monitoring solution.

ExporterGrafanaMetrics
0 likes · 5 min read
How to Monitor MySQL with Prometheus and Grafana
Selected Java Interview Questions
Selected Java Interview Questions
Mar 25, 2024 · Databases

Redis Best Practices: Memory Management, Performance Tuning, Reliability, Operations, and Security

This comprehensive guide outlines practical Redis best practices covering memory optimization, key design, data type selection, performance enhancements, high‑availability deployment, operational safeguards, security hardening, and monitoring to help engineers build stable, efficient caching solutions.

Best PracticesCachingReliability
0 likes · 15 min read
Redis Best Practices: Memory Management, Performance Tuning, Reliability, Operations, and Security
DataFunSummit
DataFunSummit
Mar 22, 2024 · Artificial Intelligence

Risk Control Model Construction for Online Small Loans: Pre‑loan, In‑loan, Post‑loan and Monitoring

This article presents a comprehensive overview of risk control model building for online small‑loan scenarios, covering pre‑loan, in‑loan and post‑loan stages, the associated data pipelines, model deployment strategies, optimization attempts, and monitoring frameworks to ensure accuracy, stability and effectiveness.

Credit Scoringdata pipelineloan management
0 likes · 16 min read
Risk Control Model Construction for Online Small Loans: Pre‑loan, In‑loan, Post‑loan and Monitoring
dbaplus Community
dbaplus Community
Mar 18, 2024 · Operations

How to Build a Resilient, High‑Traffic Web Infrastructure: A Step‑by‑Step Ops Guide

This guide outlines a complete, practical workflow for acquiring multiple domains, configuring DNS, deploying CDN and image caches, selecting data‑center locations, setting up redundant servers, implementing monitoring, handling DDoS attacks, planning capacity, securing systems, and organizing an operations team to ensure high availability for large‑scale web services.

CDNServer ConfigurationWeb infrastructure
0 likes · 12 min read
How to Build a Resilient, High‑Traffic Web Infrastructure: A Step‑by‑Step Ops Guide
Efficient Ops
Efficient Ops
Mar 18, 2024 · Operations

How to Implement Fault Self‑Healing for Scalable Operations

This article explains why low‑disk alerts demand automation, outlines the concept of fault self‑healing versus manual response, and provides practical guidelines—including standards, monitoring dimensions, CMDB integration, script execution tools, and notification channels—to build a reliable self‑healing system for large‑scale environments.

CMDBfault self-healingmonitoring
0 likes · 10 min read
How to Implement Fault Self‑Healing for Scalable Operations
Efficient Ops
Efficient Ops
Mar 17, 2024 · Operations

How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes

This article explains how to design and implement a comprehensive Prometheus‑based monitoring and alerting solution for big‑data components running on Kubernetes, covering metric exposure methods, scrape configurations, exporter deployment, alert rule design, and practical examples with code snippets.

alertingmonitoring
0 likes · 18 min read
How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes
Architect
Architect
Mar 16, 2024 · Operations

How Unified Alert Convergence Can Slash Monitoring Noise and Boost MTTA/MTTR

This article analyzes the shortcomings of fragmented monitoring systems, defines key metrics such as MTTA and MTTR, proposes a unified alert convergence architecture using Redis delayed queues, and details design, implementation, and future AI‑enhanced improvements to reduce alert fatigue and accelerate incident response.

MTTAMTTROperations
0 likes · 22 min read
How Unified Alert Convergence Can Slash Monitoring Noise and Boost MTTA/MTTR
Bitu Technology
Bitu Technology
Mar 15, 2024 · Artificial Intelligence

Monitoring Quality Issues in Tubi’s Recommendation System

This article explains how Tubi monitors the quality of its recommendation system by identifying potential failure points, tracking key data streams such as model input, final recommendation output, and training data, and designing a scalable, real‑time monitoring solution with clear protocols and extensible metrics.

Data QualityScalabilitymachine learning
0 likes · 11 min read
Monitoring Quality Issues in Tubi’s Recommendation System
Practical DevOps Architecture
Practical DevOps Architecture
Mar 15, 2024 · Operations

Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development

This multi‑chapter guide provides in‑depth, hands‑on instruction for configuring and optimizing all Prometheus components, exploring Kubernetes monitoring, source‑code analysis, custom exporter development, high‑availability setups, service discovery, resource‑efficient scraping, and integrating Thanos for long‑term storage.

KubernetesObservabilityOperations
0 likes · 4 min read
Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development
DevOps Operations Practice
DevOps Operations Practice
Mar 14, 2024 · Operations

Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions

This article analyzes why a single Prometheus instance repeatedly runs out of memory and crashes, explains the underlying storage mechanisms, and presents practical solutions such as metric reduction, retention tuning, federation architecture, and remote storage integration to improve stability and scalability.

FederationPerformancePrometheus
0 likes · 6 min read
Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Mar 13, 2024 · Operations

Top 10 Essential Tools Every Operations Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, typical use cases, key advantages, and real‑world examples, helping professionals streamline automation, monitoring, configuration, and deployment tasks and improve overall system reliability.

Operationsinfrastructuremonitoring
0 likes · 6 min read
Top 10 Essential Tools Every Operations Engineer Should Master
Linux Code Review Hub
Linux Code Review Hub
Mar 5, 2024 · Operations

Why Did Opening a Log with Vim Kill the Java Process?

A port alarm revealed a missing Java process, which was later traced to an OOM kill triggered by vim loading a 37 GB nginx log into an 8 GB container, illustrating how editor behavior and Linux's OOM killer can unexpectedly terminate critical services.

ContainerLinuxOOM killer
0 likes · 7 min read
Why Did Opening a Log with Vim Kill the Java Process?
Architecture & Thinking
Architecture & Thinking
Mar 5, 2024 · Databases

How Database Middleware Solves High‑Traffic Challenges: Connection Pools, Sharding, and More

This article examines how database middleware tackles the demanding needs of large‑scale internet services by providing centralized connection‑pool management, transparent read‑write splitting, diverse load‑balancing algorithms, sharding support, automatic failover, security controls, comprehensive monitoring, and flexible backup‑recovery mechanisms.

Connection PoolMySQLSharding
0 likes · 9 min read
How Database Middleware Solves High‑Traffic Challenges: Connection Pools, Sharding, and More
JD Tech
JD Tech
Feb 28, 2024 · Databases

Detecting and Monitoring Database Deadlocks with EasyBI: A Practical Case Study

This article recounts how a production database deadlock was uncovered during testing, explains the use of the EasyBI monitoring tool to collect and visualize error and claim statistics, and shares the step‑by‑step configuration, analysis, and lessons learned for preventing similar issues in future systems.

DatabaseEasyBIError Handling
0 likes · 8 min read
Detecting and Monitoring Database Deadlocks with EasyBI: A Practical Case Study
Huolala Tech
Huolala Tech
Feb 28, 2024 · Operations

How Huolala Created an Intelligent Automated Testing System to Raise Coverage & Cut Regression Costs

Facing rapid business expansion, Huolala’s quality assurance team tackled redundant code, high regression costs, and lack of coverage metrics by designing an intelligent automated testing framework that analyzes effective code, provides smart test case recommendations, visualizes progress, and integrates monitoring, resulting in significant coverage improvements and efficiency gains across services.

JaCoCoautomated testingcode coverage
0 likes · 25 min read
How Huolala Created an Intelligent Automated Testing System to Raise Coverage & Cut Regression Costs
DevOps Operations Practice
DevOps Operations Practice
Feb 16, 2024 · Operations

Linux, Networking, Container, and Monitoring Interview Questions

This article compiles a comprehensive set of interview-style questions covering Linux file handling, CPU metrics, link types, TCP handshakes, process vs thread, TCP/UDP differences, DDoS mitigation, Keepalived operation, TIME_WAIT optimization, container networking, Kubernetes components, deployment strategies, monitoring concepts, Prometheus architecture, and common web‑site operational issues.

ContainersInterviewLinux
0 likes · 4 min read
Linux, Networking, Container, and Monitoring Interview Questions
MaGe Linux Operations
MaGe Linux Operations
Feb 14, 2024 · Operations

Master Linux Performance: Key Factors and Essential Optimization Tools

This article examines the various hardware and OS resources that affect Linux performance—including CPU, memory, disk I/O, and network bandwidth—then details practical optimization techniques and essential monitoring tools such as vmstat, iostat, free, sar, and netstat to diagnose and improve system efficiency.

LinuxOptimizationmonitoring
0 likes · 16 min read
Master Linux Performance: Key Factors and Essential Optimization Tools
MaGe Linux Operations
MaGe Linux Operations
Feb 7, 2024 · Databases

How to Build a Real‑Time Data Guard System for Dameng Database

This guide walks through setting up a Dameng data‑guard service using a primary, standby, and monitor server, covering data preparation, configuration of dm.ini, dmmal.ini, dmarch.ini, dmwatcher.ini, starting services, OGUID setup, mode switching, and monitoring to achieve high‑availability replication.

DamengData GuardDatabase Configuration
0 likes · 12 min read
How to Build a Real‑Time Data Guard System for Dameng Database
DevOps Cloud Academy
DevOps Cloud Academy
Feb 2, 2024 · Operations

DevOps Tools for 2024: A Comprehensive Overview

An extensive overview of essential DevOps tools for 2024, covering categories such as version control, CI/CD, container orchestration, configuration management, infrastructure as code, monitoring, collaboration, artifact repositories, testing, security, deployment automation, serverless platforms, and database management to guide effective tool selection.

DevOpsautomationci/cd
0 likes · 7 min read
DevOps Tools for 2024: A Comprehensive Overview
Efficient Ops
Efficient Ops
Jan 29, 2024 · Operations

Mastering Incident Response: A Practical Guide to Faster Service Recovery

This guide walks ops teams through real‑world incident scenarios, from quick symptom identification and emergency recovery to improving monitoring, crafting concise emergency plans, and leveraging automation for smarter fault handling, helping organizations restore services faster and reduce downtime.

Incident Responseemergency planningfault handling
0 likes · 14 min read
Mastering Incident Response: A Practical Guide to Faster Service Recovery
IT Services Circle
IT Services Circle
Jan 25, 2024 · Operations

How to Resolve Online Message Queue Backlog Issues

This article explains why message queues can become backlogged, identifies producer and consumer causes, and provides practical strategies—including adding consumers, increasing queue capacity, optimizing consumption logic, implementing failure handling, and rapid remediation steps—to quickly resolve backlog in production environments.

BacklogMessage queueOperations
0 likes · 7 min read
How to Resolve Online Message Queue Backlog Issues
DevOps
DevOps
Jan 23, 2024 · Operations

Collection of Bash Scripts for Server Monitoring, Automation, and Deployment

This article provides a curated set of Bash scripts covering MySQL replication monitoring, directory change detection, bulk user creation, website health checks, remote command execution, LNMP stack deployment, server resource reporting, high‑resource process identification, and automated deployment of Java and PHP projects, offering practical automation tools for system administrators.

BashOperationsautomation
0 likes · 12 min read
Collection of Bash Scripts for Server Monitoring, Automation, and Deployment
Efficient Ops
Efficient Ops
Jan 22, 2024 · Operations

Mastering Monitoring: Black‑Box vs White‑Box, Metrics, and Prometheus in Practice

This guide explains monitoring fundamentals, clears common misconceptions, compares black‑box and white‑box approaches, outlines key metrics such as latency, traffic, errors and saturation, and provides a deep dive into Prometheus architecture, data model, query language, and practical examples for CPU, memory, and disk monitoring.

Prometheuscloud-nativemonitoring
0 likes · 15 min read
Mastering Monitoring: Black‑Box vs White‑Box, Metrics, and Prometheus in Practice
Efficient Ops
Efficient Ops
Jan 22, 2024 · Operations

How New Oriental Standardized Its Observability System to Cut Costs and Boost Efficiency

At the 21st GOPS Global Operations Conference, New Oriental's senior operations manager Qi Chen detailed the demand, technical, and focus pressures that drove a phased, full‑process observability standardization, leveraging OpenTelemetry, Telegraf, Loki and CMDB tagging to achieve cost reduction and higher stability.

Cost ReductionDevOpsOpenTelemetry
0 likes · 8 min read
How New Oriental Standardized Its Observability System to Cut Costs and Boost Efficiency
Efficient Ops
Efficient Ops
Jan 21, 2024 · Operations

Essential Bash Scripts for Efficient Server Operations and Automation

This article compiles a set of practical Bash scripts that cover MySQL replication monitoring, directory change detection with real‑time sync, bulk user creation, website health checks, remote command execution, one‑click LNMP deployment, resource usage reporting, high‑CPU process identification, and automated Java/Tomcat and PHP project deployments.

Bashdeploymentmonitoring
0 likes · 12 min read
Essential Bash Scripts for Efficient Server Operations and Automation
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jan 18, 2024 · Frontend Development

Comprehensive Guide to Front-End Performance Optimization

This article systematically outlines common front‑end performance optimization techniques, explains key web performance metrics such as Speed Index, FCP, CLS, LCP and TBT, and provides practical strategies for resource compression, network and code optimization, as well as monitoring and measurement best practices.

OptimizationResource Compressionmonitoring
0 likes · 20 min read
Comprehensive Guide to Front-End Performance Optimization
Efficient Ops
Efficient Ops
Jan 14, 2024 · Operations

Mastering Incident Command: A Practical Guide for SRE Fault Handling

This article outlines a comprehensive, step‑by‑step approach for SRE incident commanders, covering fault perception, grading, team organization, remediation tactics, transparent communication, and post‑mortem practices to efficiently resolve service disruptions.

Incident ManagementSREfault handling
0 likes · 14 min read
Mastering Incident Command: A Practical Guide for SRE Fault Handling
dbaplus Community
dbaplus Community
Jan 8, 2024 · Backend Development

How We Built an Automated Payment Channel Management System with Redis and Prometheus

To handle growing payment traffic and unreliable third‑party gateways, the team at Zhuanzhuan designed an automated payment‑channel management platform that uses a custom Redis‑based time‑series store, Prometheus monitoring, and a sliding‑window failure‑rate algorithm to detect, alert, and eventually auto‑switch faulty channels.

Prometheusautomationfault-tolerance
0 likes · 10 min read
How We Built an Automated Payment Channel Management System with Redis and Prometheus
Zhuanzhuan Tech
Zhuanzhuan Tech
Jan 5, 2024 · Operations

Building an Integrated Monitoring Platform: Architecture, Implementation, and Lessons from ZhaiZhai

This article presents a detailed case study of how ZhaiZhai designed and implemented a unified monitoring platform—combining business services, middleware, and operations resources—by selecting Prometheus and M3DB, automating Grafana dashboards, creating a low‑noise alerting system, and achieving large‑scale observability with significant cost and efficiency gains.

M3DBOperationsPrometheus
0 likes · 21 min read
Building an Integrated Monitoring Platform: Architecture, Implementation, and Lessons from ZhaiZhai
转转QA
转转QA
Jan 4, 2024 · Operations

Automated Error Log Cleanup and Monitoring Mechanism for QA

This article describes how a QA team collaborated with developers to create an automated error‑log cleanup and monitoring system, detailing the background, offline follow‑up process, identified pain points, the design of a scheduled statistics solution, platform capabilities, observed benefits, and future improvement plans.

Error LoggingQAmonitoring
0 likes · 8 min read
Automated Error Log Cleanup and Monitoring Mechanism for QA
Zhuanzhuan Tech
Zhuanzhuan Tech
Jan 4, 2024 · Backend Development

Three‑Step Strategy for Identifying and Removing Zombie Services, Methods, and Component Dependencies

This article presents a detailed three‑step plan used by Zhezhuan to detect and eliminate zombie services, unused code methods, and obsolete component dependencies through monitoring, static analysis with Spoon, and Java‑agent based runtime tracing, achieving significant resource savings and improved code health.

backend optimizationjavamonitoring
0 likes · 13 min read
Three‑Step Strategy for Identifying and Removing Zombie Services, Methods, and Component Dependencies
Tencent Cloud Developer
Tencent Cloud Developer
Jan 3, 2024 · Backend Development

Exception Handling: Requirements, Modeling, and Best Practices in Backend Development

The article outlines backend exception‑handling best practices, detailing business requirements such as memory‑safe multithreaded throws, clear separation of concerns, framework fallback strategies, simple macro‑based APIs, unified error‑code monitoring, rich debugging information, extensible type‑erased models, and appropriate handling of critical, recoverable, and checked exceptions across development and production environments.

C++DebuggingException Handling
0 likes · 28 min read
Exception Handling: Requirements, Modeling, and Best Practices in Backend Development
Liangxu Linux
Liangxu Linux
Jan 2, 2024 · Information Security

How to Monitor Linux User Activity with Built‑In Commands and Auditd

This guide explains how to track Linux user activity and system events using native commands such as who, w, last, ps, ss, journalctl, and the auditd framework, providing step‑by‑step examples and advanced auditing techniques for security and compliance.

Auditdcommandsmonitoring
0 likes · 7 min read
How to Monitor Linux User Activity with Built‑In Commands and Auditd
Goodme Frontend Team
Goodme Frontend Team
Jan 1, 2024 · Frontend Development

How Guming’s Front‑End Data Center Enables Real‑Time Monitoring for Web, Mini‑Programs, Flutter & Node.js

Guming’s Front‑End Data Center integrates monitoring, performance, logging, and analytics for web, mini‑programs, Flutter clients, and Node.js services, offering real‑time alerts, high availability, sampling, multi‑channel data pipelines, custom charting, and detailed CPU/GC profiling to streamline issue diagnosis and business insights.

Data PlatformFrontendmonitoring
0 likes · 10 min read
How Guming’s Front‑End Data Center Enables Real‑Time Monitoring for Web, Mini‑Programs, Flutter & Node.js
Architecture & Thinking
Architecture & Thinking
Dec 25, 2023 · Databases

How to Detect, Analyze, and Prevent Redis Hot Keys to Avoid Outages

This article explains what Redis hot keys are, the scenarios that generate them, their risks, and provides practical monitoring methods and mitigation strategies—including cache pre‑warming, distributed caching, rate limiting, and secondary caches—to keep production systems stable.

Hot KeyPerformancefault tolerance
0 likes · 11 min read
How to Detect, Analyze, and Prevent Redis Hot Keys to Avoid Outages
Weimob Technology Center
Weimob Technology Center
Dec 22, 2023 · Big Data

Unlocking Elasticsearch at Scale: Real‑World Practices from Weimob

The Weimob Technology Salon session on "Elasticsearch in Weimob's Practice" shares practical usage recommendations, monitoring setups with Prometheus and Grafana, field‑type guidance, and solutions to common operational challenges, offering developers actionable insights for high‑performance search deployments.

Big DataElasticsearchPerformance Optimization
0 likes · 5 min read
Unlocking Elasticsearch at Scale: Real‑World Practices from Weimob
Zuoyebang Tech Team
Zuoyebang Tech Team
Dec 22, 2023 · Databases

Unlocking Intelligent Database Operations: Inside Zyb’s Multi‑Cloud Platform

This article details how Zyb’s multi‑cloud database platform integrates diverse database types, a unified proxy layer, intelligent lifecycle management, automated task orchestration, monitoring, resource allocation, backup, and fault‑handling to achieve efficient, reliable, and secure database operations across cloud environments.

DatabasesIntelligent Operationsbackup
0 likes · 19 min read
Unlocking Intelligent Database Operations: Inside Zyb’s Multi‑Cloud Platform
dbaplus Community
dbaplus Community
Dec 20, 2023 · Operations

Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage

This article outlines how a large‑scale Kafka deployment of over a thousand machines across dozens of clusters was engineered for stability and efficiency through a custom Guardian controller that adds partition‑level throttling, automatic balancing, multi‑tenant isolation, cross‑IDC management, tiered storage, audit capabilities, and fully automated operational workflows.

Cluster ManagementKafkaMulti‑tenant
0 likes · 21 min read
Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage
Architect
Architect
Dec 15, 2023 · Industry Insights

How Bilibili Engineered a Scalable Live‑Commerce Platform from Zero to One

This article details Bilibili's step‑by‑step transformation of a fragmented, high‑coupling live‑commerce system into a modular, platform‑centric architecture, covering product middle‑platform construction, unified standards, storage migration, monitoring with Prometheus/Grafana, and performance gains such as a three‑fold query speedup and a reduction of development cycles from 46 to 5 person‑days.

BilibiliScalabilitylive commerce
0 likes · 24 min read
How Bilibili Engineered a Scalable Live‑Commerce Platform from Zero to One
DevOps Cloud Academy
DevOps Cloud Academy
Dec 14, 2023 · Operations

CI/CD Observability via OpenTelemetry at Grafana Labs

The article explains the importance of CI/CD observability, outlines common pipeline problems, introduces Grafana's GraCIe plugin built on OpenTelemetry, and discusses how enhanced visibility can improve reliability, decision‑making, and future standardization across CI/CD platforms.

DevOpsGrafanaObservability
0 likes · 13 min read
CI/CD Observability via OpenTelemetry at Grafana Labs
dbaplus Community
dbaplus Community
Dec 13, 2023 · Fundamentals

How to Design Scalable, Maintainable Software Architecture: From Principles to Practice

This article explores how to build a robust engineering architecture by prioritizing product value, defining clear layered and DDD structures, selecting appropriate technologies, and establishing standards for exception, logging, monitoring, and team collaboration to achieve scalability, maintainability, reliability, security, and high performance.

Domain‑Driven DesignException HandlingLayered Architecture
0 likes · 27 min read
How to Design Scalable, Maintainable Software Architecture: From Principles to Practice
Bilibili Tech
Bilibili Tech
Dec 12, 2023 · Backend Development

Platformization of Bilibili's Live‑Streaming E‑Commerce Business: Architecture, Implementation and Governance

Bilibili transformed its fast‑growing live‑streaming e‑commerce operation by constructing a modular platform that separates product, user, and application layers, introduces a unified product middle‑platform, standardized capabilities, real‑time attribute handling, and robust monitoring and governance, thereby reducing technical debt, improving stability, and preparing for hundred‑billion‑level GMV scaling.

BilibiliPlatform Engineeringe-commerce platform
0 likes · 24 min read
Platformization of Bilibili's Live‑Streaming E‑Commerce Business: Architecture, Implementation and Governance
Code Ape Tech Column
Code Ape Tech Column
Dec 12, 2023 · Operations

Centralized Log Collection with Filebeat and Graylog

This article explains how to use Filebeat together with Graylog to collect, ship, store, and analyze logs from multiple environments, covering tool introductions, configuration files, Docker deployment, Spring Boot integration, and practical search syntax for effective log monitoring.

ElasticsearchFilebeatGraylog
0 likes · 20 min read
Centralized Log Collection with Filebeat and Graylog
DataFunSummit
DataFunSummit
Dec 11, 2023 · Big Data

Design and Implementation of a Big Data Metadata Warehouse at Bilibili

This article presents Bilibili's big‑data metadata warehouse, covering its background, technology selection between data‑lake and data‑warehouse solutions, the architecture built on Prometheus, StarRocks, Flink and Routine Load, performance comparisons, diagnostic system design, and future development plans.

FlinkMetadata WarehouseStarRocks
0 likes · 20 min read
Design and Implementation of a Big Data Metadata Warehouse at Bilibili
Efficient Ops
Efficient Ops
Dec 10, 2023 · Cloud Native

How to Build a Complete Kubernetes Monitoring Stack with Prometheus & Grafana

This guide walks through a full Kubernetes monitoring solution using cAdvisor, node_exporter, Prometheus, and Grafana, covering architecture, data collection, service discovery, deployment steps with DaemonSets, and detailed YAML configurations for a production‑ready observability stack.

GrafanaKubernetesPrometheus
0 likes · 6 min read
How to Build a Complete Kubernetes Monitoring Stack with Prometheus & Grafana
DevOps Coach
DevOps Coach
Dec 8, 2023 · Frontend Development

How to Add Elastic RUM Monitoring to a Hugo Site

This guide explains what Elastic Real User Monitoring (RUM) is, outlines its key benefits, and provides step‑by‑step instructions with code snippets for integrating the Elastic RUM JavaScript agent into a Hugo static site, including configuration parameters and how to view the collected data in Kibana.

APMFrontendHugo
0 likes · 14 min read
How to Add Elastic RUM Monitoring to a Hugo Site
Yunxuetang Frontend Team
Yunxuetang Frontend Team
Dec 8, 2023 · Frontend Development

Key Front-End Trends and Techniques to Watch in 2023

2023 saw rapid evolution in the front‑end ecosystem, highlighted by major events, a controversial Gemini AI demo, SkyWalking‑based performance and error monitoring, innovative text‑overflow handling, CSS techniques that boost long‑list rendering by up to seven times, and an automatic, non‑intrusive skeleton‑screen generation solution.

2023FrontendPerformance
0 likes · 4 min read
Key Front-End Trends and Techniques to Watch in 2023
Open Source Linux
Open Source Linux
Dec 8, 2023 · Operations

Top 5 Log Management Tools Every DevOps Engineer Should Know

This article reviews five leading log management solutions—Graylog, LogDNA, ELK Stack, Grafana Loki, and Splunk—detailing their core components, key features, and why they are valuable for monitoring, troubleshooting, and securing modern IT environments.

DevOpsELK StackGrafana Loki
0 likes · 7 min read
Top 5 Log Management Tools Every DevOps Engineer Should Know
HomeTech
HomeTech
Dec 8, 2023 · Mobile Development

Automotive Home Push Platform Architecture and Future Development

This article introduces the architecture and core functions of Automotive Home Push Platform, covering its development history, technical implementation, monitoring system, and future plans for intelligent message distribution.

architecturecloud-nativemachine learning
0 likes · 9 min read
Automotive Home Push Platform Architecture and Future Development
Architect
Architect
Dec 5, 2023 · Backend Development

How to Build an Efficient, Low‑Complexity Microservices Architecture

This article outlines nine practical best‑practice steps for designing a low‑complexity, high‑efficiency microservices ecosystem, covering principles such as the Single Responsibility Principle, cross‑functional team organization, appropriate tooling, asynchronous communication, DevSecOps security, independent data stores, isolated deployment, orchestration, and effective monitoring, each illustrated with concrete examples.

Backend ArchitectureDevOpsDevSecOps
0 likes · 14 min read
How to Build an Efficient, Low‑Complexity Microservices Architecture
Efficient Ops
Efficient Ops
Dec 3, 2023 · Artificial Intelligence

How to Build a Zabbix Expert Advisor with GPT‑4 in Minutes

This guide walks you through why GPT‑4 outperforms GPT‑3.5, shows step‑by‑step how to create a Zabbix expert consultant using the new GPTs feature, and explains advanced configuration, knowledge‑base feeding, testing, and future possibilities for AI‑enhanced monitoring.

AI AssistantGPT-4Knowledge Base
0 likes · 7 min read
How to Build a Zabbix Expert Advisor with GPT‑4 in Minutes
Open Source Linux
Open Source Linux
Dec 1, 2023 · Operations

10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, suitable scenarios, advantages, and real‑world examples, and includes practical code snippets to help automate, monitor, and manage infrastructure efficiently.

Operationsautomationdevops tools
0 likes · 8 min read
10 Essential Ops Tools Every Engineer Should Master
Architect
Architect
Nov 30, 2023 · Cloud Native

From Monolith to Resilient Microservices: A Step‑by‑Step Architecture Evolution

The article walks through a real‑world online supermarket project, showing how a simple monolithic system evolves into a fully‑featured microservice architecture, detailing each refactoring stage, the problems encountered, and the concrete solutions such as service extraction, database sharding, monitoring, tracing, gateways, service discovery, reliability patterns, testing, and service‑mesh adoption.

Service MeshTracingarchitecture
0 likes · 25 min read
From Monolith to Resilient Microservices: A Step‑by‑Step Architecture Evolution
DevOps
DevOps
Nov 29, 2023 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article walks through the journey of transforming a simple online supermarket from a monolithic application to a fully fledged microservice architecture, highlighting the motivations, design decisions, component breakdown, operational challenges, monitoring, tracing, resilience patterns, testing strategies, and the role of service meshes.

DevOpsService Mesharchitecture
0 likes · 21 min read
Microservice Architecture Evolution: From Monolith to Service Mesh
Architecture and Beyond
Architecture and Beyond
Nov 25, 2023 · Operations

Effective Log Management Strategy: Standards, SDK Integration, and Lifecycle Practices

The article outlines common logging problems and presents a comprehensive six‑step strategy—including clear logging standards, systematic standard management, a unified SDK, centralized log management systems, regular standard reviews, and lifecycle deprecation—to transform chaotic logs into a reliable tool that boosts development efficiency.

Log ManagementLoggingOperations
0 likes · 7 min read
Effective Log Management Strategy: Standards, SDK Integration, and Lifecycle Practices
Architect
Architect
Nov 24, 2023 · Industry Insights

How We Evolved the Voice Chat Room Architecture to Scale with Real‑Time Interaction

This article chronicles the year‑long evolution of the voice‑chat room system, detailing how product‑driven requirements forced successive redesigns of both the live‑streaming and RTC subsystems, the introduction of session‑and‑channel abstractions, migration of mic‑seat management to the backend, and the implementation of monitoring, testing, and deployment practices that keep the architecture stable and extensible.

Domain‑Driven DesignRBACScalability
0 likes · 28 min read
How We Evolved the Voice Chat Room Architecture to Scale with Real‑Time Interaction
dbaplus Community
dbaplus Community
Nov 23, 2023 · Operations

How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples

This article explains why monitoring alert noise harms efficiency, presents metrics such as recall and accuracy, details rule‑based, blacklist/whitelist, ratio‑based, and intelligent noise‑reduction techniques, shares Java code examples, and shows measurable results after applying the governance process.

Alert Noise ReductionIncident ManagementOperations
0 likes · 13 min read
How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples
Sanyou's Java Diary
Sanyou's Java Diary
Nov 23, 2023 · Backend Development

From Monolith to Microservices: A Complete Journey with Real‑World Examples

This article walks through the evolution of an online supermarket from a simple monolithic website to a fully decoupled microservice architecture, covering initial requirements, common pitfalls, service decomposition, database splitting, monitoring, tracing, logging, gateways, service discovery, circuit breaking, testing, frameworks, and service mesh, while illustrating each step with diagrams and practical advice.

Circuit Breakermicroservicesmonitoring
0 likes · 22 min read
From Monolith to Microservices: A Complete Journey with Real‑World Examples
Baidu Geek Talk
Baidu Geek Talk
Nov 22, 2023 · Operations

Stability Assurance for Baidu Search Aladdin during Large-Scale Events

Baidu’s Aladdin search service safeguards stability during massive traffic spikes—such as Gaokao, the Tokyo and Beijing Olympics—by mapping dependencies, deploying multi‑dimensional monitoring, adding scaling layers like multi‑region Redis, and establishing rapid‑response on‑call teams, achieving over 99.99 % uptime and near‑real‑time data updates.

backend operationsfault handlinglarge-scale traffic
0 likes · 9 min read
Stability Assurance for Baidu Search Aladdin during Large-Scale Events
Qunar Tech Salon
Qunar Tech Salon
Nov 22, 2023 · Operations

Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis

This article details Qunar's comprehensive overhaul of its monitoring platform—introducing second‑level metrics, redesigning storage with VictoriaMetrics, optimizing client and server data collection, and building a root‑cause analysis tool—to dramatically reduce order‑related fault discovery time from minutes to under one minute.

OperationsTSDBcloud-native
0 likes · 22 min read
Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis
Alibaba Cloud Native
Alibaba Cloud Native
Nov 18, 2023 · Cloud Native

How eBPF Powers Next‑Gen Observability and Root‑Cause Analysis in Kubernetes

This talk explains the three major observability challenges in Kubernetes, demonstrates how eBPF enables comprehensive, low‑overhead data collection across all stack layers, and outlines a practical workflow that combines architecture awareness, application‑level metrics, and fault‑tree analysis to achieve automated root‑cause diagnosis.

Fault DiagnosisKuberneteseBPF
0 likes · 21 min read
How eBPF Powers Next‑Gen Observability and Root‑Cause Analysis in Kubernetes
Aikesheng Open Source Community
Aikesheng Open Source Community
Nov 15, 2023 · Databases

Understanding Redis Hotkeys: Issues, Detection Methods, and Mitigation Strategies

This article explains what Redis hotkeys are, the performance and replication problems they cause, various techniques for detecting them—including client statistics, MONITOR, the HOTKEYS command, and TCP packet capture—and practical mitigation approaches such as sharding, multi‑level caching, and monitoring optimization.

HotKeyPerformanceRedis
0 likes · 9 min read
Understanding Redis Hotkeys: Issues, Detection Methods, and Mitigation Strategies
JD Retail Technology
JD Retail Technology
Nov 8, 2023 · Operations

Technical Strategies for Ensuring System Stability During Large‑Scale Promotional Events

The article analyzes the importance of system stability during major sales promotions, presents data‑driven insights on traffic and revenue, identifies key challenges such as massive traffic, data volume, and complex workflows, and offers comprehensive operational, application, storage, and monitoring measures to guarantee reliable performance under extreme load.

Databasedeploymentlarge‑scale promotion
0 likes · 13 min read
Technical Strategies for Ensuring System Stability During Large‑Scale Promotional Events
DataFunSummit
DataFunSummit
Nov 6, 2023 · Big Data

Building and Managing Huolala's User Event Tracking System: Architecture, Governance, and Monitoring

This article details Huolala's user event tracking (埋点) system, covering its background, challenges, the construction of a four‑module management platform, backend SDK design, monitoring and quality assurance mechanisms, and future plans for service integration, data lineage, and governance optimization.

backend SDKdata governancedata pipeline
0 likes · 16 min read
Building and Managing Huolala's User Event Tracking System: Architecture, Governance, and Monitoring
Architect's Guide
Architect's Guide
Nov 6, 2023 · Operations

Comparison of Prometheus and Zabbix Monitoring Tools

This article compares the open‑source monitoring solutions Prometheus and Zabbix, outlining their histories, architectures, data collection methods, scalability, storage models, configuration complexity, community activity, and suitability for different environments such as traditional servers versus cloud‑native container platforms.

OperationsPrometheuscloud-native
0 likes · 8 min read
Comparison of Prometheus and Zabbix Monitoring Tools
NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
Nov 3, 2023 · Operations

Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling

The article outlines how game QA and third‑party providers can improve cooperation by aligning basic performance concepts such as TPS, QPS and concurrency, selecting appropriate rate‑limiting strategies, establishing precise monitoring and alerting, and preparing clear incident‑response and delivery standards.

OperationsRate Limitingmonitoring
0 likes · 15 min read
Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling
Data Thinking Notes
Data Thinking Notes
Nov 2, 2023 · Operations

How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse

This article details Bilibili's data quality assurance framework, covering its evolution across four data platform stages, the architecture of its quality data warehouse, core capabilities such as a complete assurance system, digital‑driven continuous optimization, and efficient incident handling, plus case studies, future plans, and a Q&A session.

Big DataBilibiliData Platform
0 likes · 27 min read
How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse
Efficient Ops
Efficient Ops
Nov 2, 2023 · Operations

How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation

The Industrial and Commercial Bank of China software development center created an SRE panoramic monitoring view system that unifies data channels, standardizes metrics, offers multi‑dimensional dashboards, and introduces an intelligent Ops Assistant, dramatically improving fault detection, response speed, and cross‑team operational efficiency.

Digital TransformationICBCObservability
0 likes · 6 min read
How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation
MaGe Linux Operations
MaGe Linux Operations
Oct 30, 2023 · Operations

Boost DevOps with Docker: Automation, Monitoring, and Log Management

This article explains how Docker integrates with DevOps practices to enhance automation, streamline continuous integration and deployment, enable comprehensive container, application, and infrastructure monitoring, and centralize log collection and analysis, providing practical code examples for building, testing, deploying, and managing services efficiently.

DevOpsLog Managementautomation
0 likes · 8 min read
Boost DevOps with Docker: Automation, Monitoring, and Log Management
MaGe Linux Operations
MaGe Linux Operations
Oct 27, 2023 · Cloud Native

Deploy Grafana and Prometheus on Kubernetes in Minutes

This guide walks you through preparing a Kubernetes cluster, creating deployment manifests, configuring Grafana and Prometheus, and verifying the monitoring setup, including code snippets and step‑by‑step commands for a seamless installation on a lightweight cloud server.

DevOpsGrafanaKubernetes
0 likes · 7 min read
Deploy Grafana and Prometheus on Kubernetes in Minutes
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Oct 27, 2023 · Databases

Corona Technical Series: Time-Series Databases in Corona

The article explains how Corona leverages three time‑series databases—InfluxDB for storing pre‑aggregated user metrics and platform health data, ClickHouse for real‑time multidimensional log analysis with aggregations, and ElasticSearch for full‑text searchable log monitoring—detailing their schema designs and query examples.

ClickHouseCoronaDatabase Architecture
0 likes · 19 min read
Corona Technical Series: Time-Series Databases in Corona
Su San Talks Tech
Su San Talks Tech
Oct 27, 2023 · Operations

What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review

This article walks through Yuque’s October 23 service disruption, detailing each timeline milestone, analyzing the root causes, highlighting the importance of monitoring and data integrity checks, and offering concrete post‑mortem recommendations to improve future incident handling.

Cloud ServicesIncident Responsedisaster recovery
0 likes · 12 min read
What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review