Tagged articles
2195 articles
Page 5 of 22
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Jan 13, 2025 · Operations

Top Logstash Interview Questions 11‑20: Answers and Practical Configurations

This article provides concise answers and example configurations for eleven common Logstash interview questions, covering HTTP input/poller plugins, the Split filter, pipeline debugging, performance monitoring with Metricbeat, Grok failure handling, secure communication, multi‑source collection, multiple outputs, differences from Elasticsearch ingest pipelines, and Kibana pipeline management.

ElasticsearchGrokLogstash
0 likes · 7 min read
Top Logstash Interview Questions 11‑20: Answers and Practical Configurations
Open Source Linux
Open Source Linux
Jan 13, 2025 · Operations

Key Lessons from 2024 Major Service Outages and How to Prevent Future Downtime

The article reviews major 2024 service outages—from Alibaba Cloud to OpenAI—highlights their root causes, and offers practical operations strategies such as disaster recovery, regular backups, load balancing, monitoring, performance tuning, and capacity planning to reduce future downtime.

Operationscapacity planningdisaster recovery
0 likes · 5 min read
Key Lessons from 2024 Major Service Outages and How to Prevent Future Downtime
Java Backend Technology
Java Backend Technology
Jan 9, 2025 · Backend Development

How DynamicTp Enables Real‑Time ThreadPool Tuning and Monitoring in Java

DynamicTp is a SpringBoot‑compatible framework that extends ThreadPoolExecutor to allow live adjustment of pool parameters, real‑time monitoring, multi‑platform alerts, and seamless integration with popular configuration centers, helping Java services achieve higher performance and reliability.

Dynamic ConfigurationSpringBootThreadPool
0 likes · 11 min read
How DynamicTp Enables Real‑Time ThreadPool Tuning and Monitoring in Java
Su San Talks Tech
Su San Talks Tech
Jan 8, 2025 · Backend Development

How DynamicTp Enables Real‑Time Thread Pool Monitoring and Auto‑Tuning in Java

DynamicTp extends Java's ThreadPoolExecutor with zero‑intrusion configuration, real‑time parameter adjustment, comprehensive monitoring, and multi‑channel alerting, allowing developers to dynamically tune thread pools across microservices using popular configuration centers and integrate with tools like Micrometer and Grafana.

DynamicTpJavaMicrometer
0 likes · 11 min read
How DynamicTp Enables Real‑Time Thread Pool Monitoring and Auto‑Tuning in Java
macrozheng
macrozheng
Jan 7, 2025 · Backend Development

DynamicTp: Real‑time Monitoring and Dynamic Scaling for SpringBoot Thread Pools

This article introduces DynamicTp, a zero‑intrusion SpringBoot starter that provides real‑time monitoring, dynamic adjustment, and alerting of ThreadPoolExecutor parameters via popular configuration centers, supporting multiple middleware thread pools, various metrics exporters, and extensible SPI interfaces for enterprise‑grade thread‑pool management.

ConfigurationCenterDynamicThreadPoolThreadPoolExecutor
0 likes · 11 min read
DynamicTp: Real‑time Monitoring and Dynamic Scaling for SpringBoot Thread Pools
IT Architects Alliance
IT Architects Alliance
Jan 6, 2025 · Operations

Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies

The article explores how distributed systems achieve high reliability through redundant design, precise fault detection and recovery, data replication and synchronization, coordinated fault tolerance and load balancing, distributed transaction handling, comprehensive monitoring, elastic scaling, security safeguards, and robust disaster‑recovery planning.

Reliabilityfault tolerancemonitoring
0 likes · 18 min read
Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies
IT Architects Alliance
IT Architects Alliance
Dec 29, 2024 · Operations

Design Principles and Key Technologies for High‑Availability Systems

The article explains why 24/7 high‑availability systems are essential for modern enterprises and details core design principles, layered architecture, and critical technologies such as redundancy, load balancing, caching, elastic scaling, monitoring, and fault‑tolerance to ensure continuous, reliable service.

Cloud ComputingSystem Designhigh availability
0 likes · 23 min read
Design Principles and Key Technologies for High‑Availability Systems
Architect
Architect
Dec 27, 2024 · Big Data

Fault Self‑Healing System for Large‑Scale Big Data Clusters

This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.

Big DataCluster Managementfault self-healing
0 likes · 16 min read
Fault Self‑Healing System for Large‑Scale Big Data Clusters
Linux Ops Smart Journey
Linux Ops Smart Journey
Dec 27, 2024 · Cloud Native

How to Enable Ceph Enterprise Monitoring with Prometheus & Grafana

Learn step‑by‑step how to activate Ceph’s monitoring modules, configure Prometheus to collect Ceph metrics, verify data collection, and integrate Grafana dashboards, including tips on required dependencies and troubleshooting, to ensure reliable, secure storage management in enterprise cloud‑native environments.

CephGrafanaPrometheus
0 likes · 4 min read
How to Enable Ceph Enterprise Monitoring with Prometheus & Grafana
Yang Money Pot Technology Team
Yang Money Pot Technology Team
Dec 26, 2024 · Frontend Development

Design and Implementation of a Multi‑CDN Disaster Recovery Mechanism for Frontend Resource Loading

This article presents a comprehensive multi‑CDN disaster‑recovery solution for frontend static resources, detailing the background, current issues, goals, SDK‑based architecture, monitoring and retry strategies, data‑reporting mechanisms, evaluation results, and future dynamic scheduling improvements.

CDNFrontendRetry
0 likes · 12 min read
Design and Implementation of a Multi‑CDN Disaster Recovery Mechanism for Frontend Resource Loading
Linux Ops Smart Journey
Linux Ops Smart Journey
Dec 20, 2024 · Cloud Native

How to Set Up MinIO Enterprise Monitoring with Prometheus & Grafana

This guide walks you through configuring MinIO's enterprise monitoring panel, generating Prometheus metrics for clusters, nodes, buckets, and resources, integrating them into Grafana dashboards, and verifying successful data collection to enhance data management and operational efficiency.

GrafanaPrometheusmonitoring
0 likes · 7 min read
How to Set Up MinIO Enterprise Monitoring with Prometheus & Grafana
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Dec 20, 2024 · Operations

20 Must‑Know Production Ops Issues and Quick Fixes

This guide presents twenty common production‑environment problems—from log analysis and database recovery to Kubernetes scheduling—detailing real‑world scenarios, step‑by‑step command solutions, and preventive measures that help engineers quickly diagnose, resolve, and avoid outages.

DevOpsOperationsmonitoring
0 likes · 17 min read
20 Must‑Know Production Ops Issues and Quick Fixes
DevOps Operations Practice
DevOps Operations Practice
Dec 17, 2024 · Backend Development

From CPU Alert to Resolution: A Step‑by‑Step Backend Performance Debugging Guide

This article recounts a midnight CPU alert incident and walks through systematic backend troubleshooting—from initial system checks and JVM profiling to algorithm refactoring, database indexing, Docker‑based isolation, and proactive monitoring—demonstrating how to restore service performance and prevent future outages.

DatabaseDockerJVM
0 likes · 7 min read
From CPU Alert to Resolution: A Step‑by‑Step Backend Performance Debugging Guide
转转QA
转转QA
Dec 13, 2024 · Operations

Data Point Governance and Quality Management in ZhiZhu QA Process

This article describes how ZhiZhu's quality inspection team introduced a two‑stage data‑point governance framework—initial manual enforcement followed by automated system monitoring, real‑time validation, user‑behavior trees, and dashboards—to dramatically improve data quality, testing efficiency, and issue resolution.

QAdashboardmonitoring
0 likes · 9 min read
Data Point Governance and Quality Management in ZhiZhu QA Process
Efficient Ops
Efficient Ops
Dec 11, 2024 · Operations

Thanos vs VictoriaMetrics: Which Prometheus Storage Solution Wins for Scale and Cost?

This article compares Thanos and VictoriaMetrics as long‑term storage solutions for Prometheus, evaluating their architecture, write and read paths, reliability, consistency, performance, scalability, high‑availability, and hosting costs to help you choose the most suitable option for your monitoring stack.

Long‑term StorageThanosVictoriaMetrics
0 likes · 18 min read
Thanos vs VictoriaMetrics: Which Prometheus Storage Solution Wins for Scale and Cost?
JD Cloud Developers
JD Cloud Developers
Dec 10, 2024 · Operations

How We Boosted Inventory Platform Stability 24× with Smart Traffic Splitting and Redis Caching

This article examines the stability challenges of an e‑commerce inventory platform—including workflow complexity, database hotspots, and high‑frequency calculations—and details comprehensive solutions such as traffic splitting, gray releases, Redis caching, data consistency mechanisms, rate limiting, and monitoring enhancements that together improved throughput by 24× and reduced latency dramatically.

OperationsRedisinventory
0 likes · 14 min read
How We Boosted Inventory Platform Stability 24× with Smart Traffic Splitting and Redis Caching
Top Architect
Top Architect
Dec 9, 2024 · Databases

Database Monitoring and Slow Query Log Management Guide

This article provides a practical guide on monitoring database system resources using Linux commands, configuring MySQL slow query logging, analyzing performance issues, and outlines best practices, while also promoting a ChatGPT community and related services.

DevOpsLoggingMySQL
0 likes · 7 min read
Database Monitoring and Slow Query Log Management Guide
Efficient Ops
Efficient Ops
Dec 8, 2024 · Operations

Diagnosing High Load with Low CPU on Linux: Commands and Tips

This guide explains how to analyze and troubleshoot situations where a Linux system shows high load averages despite low CPU usage, covering common load analysis methods, key commands like top, vmstat, iostat, and practical solutions for I/O bottlenecks and stuck processes.

CPULinuxLoad
0 likes · 11 min read
Diagnosing High Load with Low CPU on Linux: Commands and Tips
Top Architect
Top Architect
Dec 5, 2024 · Databases

Database Monitoring and Slow Query Log Management Guide

This article explains how database administrators can monitor system resource usage with commands like top, iostat, and vmstat, and configure MySQL slow query logging, including enabling the log, setting thresholds, viewing logs, and applying best‑practice recommendations for analysis and issue resolution.

Database AdministrationLinux commandsMySQL
0 likes · 8 min read
Database Monitoring and Slow Query Log Management Guide
Linux Ops Smart Journey
Linux Ops Smart Journey
Dec 3, 2024 · Cloud Native

How to Set Up Harbor Monitoring with Prometheus and Grafana

Learn step‑by‑step how to deploy the harbor‑exporter, configure Prometheus to scrape Harbor metrics, verify data collection, and add official Grafana dashboards, enabling real‑time monitoring of your Harbor registry for improved stability, security, and performance in cloud‑native environments.

GrafanaHarborKubernetes
0 likes · 6 min read
How to Set Up Harbor Monitoring with Prometheus and Grafana
58 Tech
58 Tech
Nov 27, 2024 · Operations

Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned

This article details how 58 Group’s cloud authentication service introduced an observability framework—optimizing logs, employing distributed tracing, defining SLO/SLA metrics, and implementing burn‑rate alerts—to improve fault detection, reduce false alarms, and achieve faster root‑cause analysis across the system.

Error BudgetObservabilitySLO
0 likes · 16 min read
Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned
JD Cloud Developers
JD Cloud Developers
Nov 27, 2024 · Operations

Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services

This article explains the core concepts of SLA, SLO, and SLI, demonstrates how to set realistic service level objectives, manage alert noise, and apply practical examples—including API, MQ, and scheduled task monitoring—to improve system reliability and performance during high‑traffic events like 11.11 promotions.

SLASLISLO
0 likes · 23 min read
Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services
Liangxu Linux
Liangxu Linux
Nov 23, 2024 · Cloud Native

How a Solo Engineer Runs a Full‑Stack SaaS on Kubernetes

This article details how a single‑person startup leverages Kubernetes on AWS EKS to handle load balancing, automatic DNS, TLS, autoscaling, monitoring, alerting, secret management, and CI/CD for a Django‑based SaaS, illustrating practical configurations, code snippets, and infrastructure‑as‑code patterns.

AWS EKSDjangoGitOps
0 likes · 16 min read
How a Solo Engineer Runs a Full‑Stack SaaS on Kubernetes
ITPUB
ITPUB
Nov 23, 2024 · Operations

Zabbix vs Prometheus: Which Monitoring Tool Wins for Modern Cloud Environments?

This article compares Zabbix and Prometheus across performance, data collection, visualization, and alerting, highlighting their architectural differences, ecosystem strengths, and suitability for traditional data‑center monitoring versus dynamic cloud‑native workloads.

ObservabilityPrometheusalerting
0 likes · 11 min read
Zabbix vs Prometheus: Which Monitoring Tool Wins for Modern Cloud Environments?
Alibaba Cloud Observability
Alibaba Cloud Observability
Nov 22, 2024 · Cloud Native

Mastering Alibaba Cloud Observability: Tagging Strategies for Efficient Resource Management

This article explains how Alibaba Cloud’s observability suite uses tag metadata to organize, monitor, and secure resources across business, endpoints, applications, middleware, and containers, offering best‑practice design principles and real‑world case studies for building scalable, tag‑driven monitoring dashboards.

Alibaba CloudCloud NativeTag Management
0 likes · 25 min read
Mastering Alibaba Cloud Observability: Tagging Strategies for Efficient Resource Management
Cognitive Technology Team
Cognitive Technology Team
Nov 19, 2024 · Operations

Compile-Time Automatic Instrumentation for Go Applications: Principles, Modular Extensions, and Practical Usage

This article introduces a zero‑intrusive compile‑time automatic instrumentation framework for Go, explains its preprocessing and code‑injection mechanisms, and provides modular extension principles with concrete examples such as HTTP header logging, sort algorithm replacement, SQL injection protection, and gRPC traffic control.

Automatic InstrumentationGoModular Extension
0 likes · 18 min read
Compile-Time Automatic Instrumentation for Go Applications: Principles, Modular Extensions, and Practical Usage
Ops Development Stories
Ops Development Stories
Nov 19, 2024 · Operations

How to Install and Explore Nightingale v7.7: New Features, Upgrade Guide, and Hands‑On Demo

This article introduces Nightingale monitoring's final v7.7 release, outlines its new features and major v7 changes, provides step‑by‑step upgrade instructions, and walks through a Docker‑based installation, data‑source integration, dashboard import, and alert‑rule configuration with DingTalk notifications.

Alert RulesDockerOperations
0 likes · 10 min read
How to Install and Explore Nightingale v7.7: New Features, Upgrade Guide, and Hands‑On Demo
Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Nov 18, 2024 · Backend Development

Master Spring Boot 3 Actuator: Custom Endpoints, Health Checks, and Monitoring

Explore comprehensive Spring Boot 3 Actuator capabilities—including enabling CORS, creating custom endpoints, configuring health indicators, HTTP tracing, security auditing, and process monitoring—through detailed explanations, YAML configurations, and full Java code examples, empowering developers to effectively monitor and manage production-ready applications.

ActuatorCustom EndpointSpring Boot
0 likes · 8 min read
Master Spring Boot 3 Actuator: Custom Endpoints, Health Checks, and Monitoring
JD Retail Technology
JD Retail Technology
Nov 13, 2024 · R&D Management

Guidelines for New Project Managers: Initiation, Planning, Execution, and Monitoring

This article shares practical advice for novice project managers, covering the four process groups—initiation, planning, execution, and monitoring—through real‑world examples, stakeholder identification, risk handling, change control, and communication techniques to help them deliver value and grow their teams.

PlanningProject Managementexecution
0 likes · 25 min read
Guidelines for New Project Managers: Initiation, Planning, Execution, and Monitoring
Linux Ops Smart Journey
Linux Ops Smart Journey
Nov 12, 2024 · Databases

Master PostgreSQL Monitoring with Grafana: Step-by-Step Guide

Learn how to deploy postgres_exporter, configure PostgreSQL extensions, set up Prometheus scraping, and create Grafana dashboards for comprehensive PostgreSQL performance monitoring, complete with command-line instructions and tips for verifying data collection and visualizing metrics.

DatabaseGrafanaPostgreSQL
0 likes · 6 min read
Master PostgreSQL Monitoring with Grafana: Step-by-Step Guide
Java Tech Enthusiast
Java Tech Enthusiast
Nov 10, 2024 · Databases

Database Monitoring and Logging Practices

Effective database administration relies on continuous monitoring of system resources—CPU, memory, disk I/O, and network—using tools like top, iostat, and vmstat, alongside logging slow MySQL queries, analyzing performance bottlenecks, and following best practices such as automated monitoring, centralized log management, regular audits, and log backups.

DatabaseLinuxLogging
0 likes · 5 min read
Database Monitoring and Logging Practices
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Nov 5, 2024 · Operations

10 Essential Linux Ops Tools Every Engineer Should Master

This article introduces ten indispensable Linux operations tools—Shell scripting, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, typical use cases, advantages, and practical examples to help engineers automate and monitor infrastructure efficiently.

Configuration ManagementDevOpsOperations
0 likes · 9 min read
10 Essential Linux Ops Tools Every Engineer Should Master
Linux Ops Smart Journey
Linux Ops Smart Journey
Nov 3, 2024 · Cloud Native

Build a Robust Kubernetes Monitoring System with Prometheus and HAProxy

This guide walks you through setting up a comprehensive Kubernetes monitoring solution—covering component metrics collection, configuring HAProxy for network access, exposing metrics from kube-proxy, Calico, and kube-state-metrics, and integrating everything into Prometheus for reliable cluster health visibility.

CalicoHAProxyKubernetes
0 likes · 12 min read
Build a Robust Kubernetes Monitoring System with Prometheus and HAProxy
BirdNest Tech Talk
BirdNest Tech Talk
Nov 3, 2024 · Databases

Master ClickHouse Write Performance: Proven Optimization Strategies

This comprehensive guide walks through ClickHouse write‑performance optimization, covering hardware choices, system and application‑level tuning, async insert settings, Buffer engine configuration, storage compression, real‑world case studies, monitoring queries, and actionable best‑practice recommendations.

Async InsertBuffer EngineClickHouse
0 likes · 12 min read
Master ClickHouse Write Performance: Proven Optimization Strategies
Java Tech Enthusiast
Java Tech Enthusiast
Nov 1, 2024 · Databases

Quick MySQL Configuration and Monitoring Queries

This guide presents essential MySQL configuration and monitoring queries—covering connection limits, Binlog/GTID status, InnoDB settings—plus a one‑click script that consolidates these checks, enabling quick health assessments and more efficient routine inspections of MySQL servers.

DatabaseMySQLSQL
0 likes · 2 min read
Quick MySQL Configuration and Monitoring Queries
Java Architect Essentials
Java Architect Essentials
Oct 27, 2024 · Operations

Integrating Prometheus with Spring Boot for Real‑time Monitoring and Grafana Visualization

This article explains how to use Prometheus together with Spring Boot Actuator and Micrometer to collect, expose, and visualize application metrics, including step‑by‑step dependency configuration, YAML settings, Docker deployment of Prometheus and Grafana, and adding custom metrics for comprehensive monitoring.

ActuatorGrafanaMicrometer
0 likes · 10 min read
Integrating Prometheus with Spring Boot for Real‑time Monitoring and Grafana Visualization
dbaplus Community
dbaplus Community
Oct 23, 2024 · Backend Development

Mastering Java Thread Pools: Common Pitfalls and Best Practices

This article outlines how to correctly create, monitor, and configure Java ThreadPoolExecutor instances, explains why using the Executors factory can cause OOM, recommends separate named pools per business, provides formulas for sizing CPU‑bound and I/O‑bound workloads, and highlights real‑world pitfalls and dynamic‑configuration solutions.

Concurrencymonitoringspring
0 likes · 16 min read
Mastering Java Thread Pools: Common Pitfalls and Best Practices
DeWu Technology
DeWu Technology
Oct 23, 2024 · Backend Development

Automated Traffic Rule Inspection with Flow Replay Platform

The Flow Replay Platform automates traffic‑rule inspection by recording traffic from all environments, letting engineers define jsonPath‑based interface rules that continuously validate pre‑release and production traffic, instantly alerting anomalies, reducing false positives, accelerating release verification, and cutting manual testing effort, as demonstrated by discovered coupon‑related bugs.

automated testingbackendmonitoring
0 likes · 9 min read
Automated Traffic Rule Inspection with Flow Replay Platform
Efficient Ops
Efficient Ops
Oct 21, 2024 · Operations

Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability

This article shares practical Prometheus best‑practice tips—from understanding its accuracy‑reliability trade‑offs and self‑monitoring, to avoiding NFS storage, managing high‑cardinality metrics, handling rate() and recording‑rule pitfalls, and fine‑tuning alerting—so you can run a stable, low‑cost monitoring stack.

Cloud NativeObservabilityOperations
0 likes · 10 min read
Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability
JD Tech Talk
JD Tech Talk
Oct 21, 2024 · Operations

Observability and Quality Assurance: Strategies for Test Teams

This article examines how test teams can enhance application observability and quality assurance by distinguishing observability from traditional monitoring, defining goals, outlining a monitoring foundation, and proposing module‑level and system‑level strategies for proactive fault detection, data analysis, and alerting.

Observabilitymonitoringquality assurance
0 likes · 12 min read
Observability and Quality Assurance: Strategies for Test Teams
JD Cloud Developers
JD Cloud Developers
Oct 21, 2024 · Operations

How Test Teams Can Build Observability Beyond Traditional Monitoring

This article examines how quality assurance engineers can adopt observability principles—distinct from conventional monitoring—to enhance system health detection, root‑cause analysis, and proactive risk mitigation across resources, services, business functions, data, and logs.

ObservabilityOperationsmonitoring
0 likes · 17 min read
How Test Teams Can Build Observability Beyond Traditional Monitoring
Test Development Learning Exchange
Test Development Learning Exchange
Oct 11, 2024 · Fundamentals

Fundamentals of Performance Testing: Concepts, Metrics, Tools, and Best Practices

This article provides a comprehensive overview of performance testing fundamentals, covering core concepts, key metrics, common testing tools, test design, load generation, result analysis, bottleneck identification, optimization techniques, cloud and micro‑service testing, monitoring, reporting, challenges, and cost‑benefit considerations.

BenchmarkingOptimizationload-testing
0 likes · 12 min read
Fundamentals of Performance Testing: Concepts, Metrics, Tools, and Best Practices
Java Architecture Stack
Java Architecture Stack
Oct 11, 2024 · Operations

25 Proven Linux Performance Tuning Tricks to Boost System Speed

Learn 25 practical Linux performance tuning techniques—from adjusting kernel parameters like swappiness and ulimit to optimizing I/O schedulers, network buffers, and enabling HugePages—each with clear commands and step‑by‑step instructions to help you maximize system responsiveness and throughput.

I/O schedulerKernel ParametersLinux
0 likes · 10 min read
25 Proven Linux Performance Tuning Tricks to Boost System Speed
DevOps Operations Practice
DevOps Operations Practice
Oct 10, 2024 · Operations

Seven Key Truths About Operations: Downtime, Automation, Prevention, Technology as a Tool, DevOps, Communication, and Security

Effective operations management acknowledges inevitable downtime, emphasizes automation, prioritizes proactive prevention, treats technology as a means rather than an end, integrates closely with development through DevOps, relies on strong communication, and continuously addresses pervasive security challenges to minimize business impact.

AutomationOperationsSecurity
0 likes · 5 min read
Seven Key Truths About Operations: Downtime, Automation, Prevention, Technology as a Tool, DevOps, Communication, and Security
Efficient Ops
Efficient Ops
Oct 9, 2024 · Cloud Computing

How One Engineer Runs a Full SaaS on Kubernetes with Minimal Effort

This article details how a solo engineer built and operated a SaaS platform on AWS using Kubernetes, covering infrastructure overview, automatic DNS, TLS, load balancing, CI/CD rollouts, autoscaling, caching, secret management, monitoring, logging, error tracking, and cost‑effective operations.

AutoscalingKubernetesaws
0 likes · 21 min read
How One Engineer Runs a Full SaaS on Kubernetes with Minimal Effort
Selected Java Interview Questions
Selected Java Interview Questions
Oct 7, 2024 · Operations

Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples

This article introduces ten essential tools for operations engineers—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing each tool's functionality, typical scenarios, advantages, and real‑world examples with code snippets for practical automation and monitoring.

AutomationInfrastructureOperations
0 likes · 8 min read
Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples
Efficient Ops
Efficient Ops
Sep 29, 2024 · Operations

Essential Linux Ops Tools Every Sysadmin Must Master

This guide outlines the ten core tool categories—from Linux basics and networking services to scripting, firewalls, monitoring, clustering, and backup—that a Linux operations engineer should master to become an effective sysadmin.

DatabaseLinuxOperations
0 likes · 6 min read
Essential Linux Ops Tools Every Sysadmin Must Master
ITPUB
ITPUB
Sep 29, 2024 · Databases

Quick Oracle SQL Monitoring Script – Copy‑Paste Ready

This article shares a ready‑to‑run Oracle SQL*Plus script that lists active sessions with details such as instance ID, username, execution time, SQL text snippet, current event, and wait seconds, plus an example output for immediate performance troubleshooting.

OracleSQLmonitoring
0 likes · 4 min read
Quick Oracle SQL Monitoring Script – Copy‑Paste Ready
IT Architects Alliance
IT Architects Alliance
Sep 28, 2024 · Operations

How DevOps Transforms IT: Core Principles, Practices, and Real-World Success

This article explores the DevOps mindset, its core principles such as collaboration, automation, continuous improvement, and customer focus, outlines essential practices like CI/CD, IaC, monitoring, microservices, and provides a step‑by‑step adoption roadmap illustrated with a detailed case study and future trends.

AutomationCloud NativeDevOps
0 likes · 11 min read
How DevOps Transforms IT: Core Principles, Practices, and Real-World Success
Python Programming Learning Circle
Python Programming Learning Circle
Sep 28, 2024 · Operations

Essential Skills for Becoming a Successful DevOps Engineer

The article outlines the key competencies a DevOps engineer must master—including programming, Linux system knowledge, configuration management, infrastructure-as-code, CI/CD tools, networking and security, monitoring, and cloud services—to guide readers on building a comprehensive skill set for effective DevOps practice.

DevOpsLinuxOperations
0 likes · 5 min read
Essential Skills for Becoming a Successful DevOps Engineer
Alibaba Cloud Native
Alibaba Cloud Native
Sep 27, 2024 · Cloud Native

How SAE’s Cloud‑Native Event Center Tackles Data Explosion and Real‑Time Alerts

The article explains the design and implementation of the Serverless Application Engine (SAE) Event Center, highlighting its cloud‑native architecture, the distinction from traditional monitoring, challenges like data explosion and full GC, and the distributed‑cache solution that enables efficient real‑time event aggregation, notification, and future AI‑driven diagnostics.

Data ExplosionDistributed CacheSAE
0 likes · 10 min read
How SAE’s Cloud‑Native Event Center Tackles Data Explosion and Real‑Time Alerts
php Courses
php Courses
Sep 27, 2024 · Backend Development

Developing Real-Time Monitoring Applications with PHP and WebSocket

This article explains how to build real-time monitoring applications using PHP and the WebSocket protocol, covering the fundamentals of WebSocket, setting up a Ratchet server, creating client-side JavaScript connections, and providing complete code examples such as a stock price monitor.

backendmonitoringreal-time
0 likes · 7 min read
Developing Real-Time Monitoring Applications with PHP and WebSocket
DevOps Engineer
DevOps Engineer
Sep 25, 2024 · Operations

Understanding What DevOps Truly Is: Principles Over Tools

The article clarifies that DevOps is not defined by specific tools like Kubernetes or Jenkins, but by the ability to design robust systems that ensure smooth deployments, effortless scaling, reliable operation, early issue detection, and effective team collaboration, emphasizing enduring principles over changing technologies.

AutomationCollaborationContinuousDelivery
0 likes · 3 min read
Understanding What DevOps Truly Is: Principles Over Tools
Efficient Ops
Efficient Ops
Sep 24, 2024 · Operations

Master Linux Performance in 60 Seconds: 10 Essential Commands

When a Linux server shows performance issues, the first minute is critical; this guide walks you through ten standard command‑line tools—uptime, dmesg, vmstat, mpstat, pidstat, iostat, free, sar, and top—explaining what each metric means and how to interpret the output for quick troubleshooting.

LinuxOperationsmonitoring
0 likes · 19 min read
Master Linux Performance in 60 Seconds: 10 Essential Commands
dbaplus Community
dbaplus Community
Sep 23, 2024 · Operations

How Bilibili Scaled Monitoring: From Prometheus to a 2.0 VM‑Flink Architecture

Bilibili rebuilt its monitoring platform to handle explosive metric growth by separating collection, storage, and compute, adopting VictoriaMetrics, zone‑based scheduling, and Flink‑driven pre‑aggregation, which together improved stability, query performance, cloud data quality, and overall observability.

FlinkObservabilityPrometheus
0 likes · 31 min read
How Bilibili Scaled Monitoring: From Prometheus to a 2.0 VM‑Flink Architecture
Ctrip Technology
Ctrip Technology
Sep 23, 2024 · Frontend Development

Intelligent Alert Attribution System for Ctrip Hotel Frontend: Design, Implementation, and Outcomes

This article details the design and deployment of an intelligent alert attribution system for Ctrip Hotel's front‑end, describing the background challenges, the unified data pool, weighted alert rules, three attribution algorithms, achieved improvements in accuracy and troubleshooting speed, and future enhancement plans.

AlertFrontendMachine Learning
0 likes · 18 min read
Intelligent Alert Attribution System for Ctrip Hotel Frontend: Design, Implementation, and Outcomes
Open Source Linux
Open Source Linux
Sep 19, 2024 · Operations

Mastering Linux Performance: From CPU/Memory Profiling to Flame Graphs

This guide explains how to systematically diagnose Linux performance issues using tools such as top, vmstat, perf, and flame graphs, covering CPU, memory, disk I/O, network, and load analysis, and demonstrates a real-world nginx case study with step‑by‑step commands and visualizations.

Profilingflame graphsmonitoring
0 likes · 21 min read
Mastering Linux Performance: From CPU/Memory Profiling to Flame Graphs
dbaplus Community
dbaplus Community
Sep 13, 2024 · Operations

How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability

This article explains how Bilibili designed and implemented an Emergency Response Center (ERC) to manage the full fault lifecycle—detection, response, delimitation, localization, mitigation and recovery—using alert rules, automated recall, integrated customer feedback, clear role assignments, mobile support, and data‑driven post‑mortems, ultimately reducing MTTR and improving service reliability.

AutomationIncident ManagementMTTR
0 likes · 23 min read
How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability
Test Development Learning Exchange
Test Development Learning Exchange
Sep 13, 2024 · Fundamentals

Python Standard Library for Linux: File Operations, Process Management, Networking, System Info, Time, Logging, Monitoring, Compression, and Environment Variables

This article provides a comprehensive guide to Python's standard libraries for Linux, covering file and directory manipulation, process control, socket networking, system information retrieval, date and time handling, logging, file monitoring, compression, and environment variable management with clear code examples.

Loggingfile-operationsmonitoring
0 likes · 10 min read
Python Standard Library for Linux: File Operations, Process Management, Networking, System Info, Time, Logging, Monitoring, Compression, and Environment Variables
Open Source Linux
Open Source Linux
Sep 13, 2024 · Operations

Essential Bash Scripts for Server Monitoring, Automation, and Security

This article presents a collection of practical Bash scripts that cover file consistency checks, scheduled log management, network traffic monitoring, numeric analysis, FTP downloads, user input handling, Nginx 502 detection, variable assignments, bulk file renaming, text processing, port scanning, word filtering, command menus, SSH automation with Expect, user creation, Apache monitoring, password rotation, iptables rate‑limiting, and IP validation, providing sysadmins with ready‑to‑use solutions for everyday Linux operations.

BashServer AutomationShell scripting
0 likes · 25 min read
Essential Bash Scripts for Server Monitoring, Automation, and Security
Architect
Architect
Sep 12, 2024 · Operations

How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation

The article details Bilibili's evolution of its monitoring platform, describing the stability and performance challenges of a Prometheus‑Thanos stack, the redesign using VictoriaMetrics, collection‑storage separation, unit‑level disaster recovery, query‑tree auto‑replacement, Flink‑based pre‑aggregation, Grafana upgrades, and future roadmap for observability.

Cloud NativeFlinkMetrics
0 likes · 30 min read
How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation
FunTester
FunTester
Sep 11, 2024 · Operations

Pinterest Performance Plan: Real‑User Monitoring, Regression Detection, and Alerting

Pinterest’s performance program details how the team defines custom Pinner Wait Time metrics, uses real‑user monitoring and fine‑grained alerts to detect regressions quickly, and follows structured root‑cause analysis and ownership processes to prevent performance degradation across web surfaces.

Operationsmonitoringreal‑user
0 likes · 18 min read
Pinterest Performance Plan: Real‑User Monitoring, Regression Detection, and Alerting
JD Tech
JD Tech
Sep 9, 2024 · Backend Development

JADE Dynamic Thread Pool Integration and Visualization Platform Practice

This article explains how to integrate JD's JADE dynamic thread‑pool component with the Wanxiang visualization platform, covering Maven dependencies, configuration files, Spring bean setup, thread‑pool creation, runtime monitoring, underlying source‑code principles, and common pitfalls for stable backend services.

Dynamic Thread PoolJADEJava
0 likes · 20 min read
JADE Dynamic Thread Pool Integration and Visualization Platform Practice
dbaplus Community
dbaplus Community
Sep 8, 2024 · Operations

10 Essential Ops Practices to Prevent System Failures

This article compiles ten practical operations‑engineer guidelines—ranging from change rollbacks and safe command aliases to backup verification, monitoring, and cautious automated failover—to help maintain high availability and avoid costly production incidents.

AutomationLinuxMySQL
0 likes · 18 min read
10 Essential Ops Practices to Prevent System Failures
Software Development Quality
Software Development Quality
Sep 6, 2024 · R&D Management

How to Boost Release Quality: Proven Practices for R&D Teams

This guide outlines essential strategies to improve release quality in R&D, covering strict testing processes, automated CI/CD pipelines, containerization, real‑time monitoring, alert mechanisms, and feedback loops, while also defining key evaluation metrics and practical steps for effective management of these indicators.

ci/cdmonitoringrelease quality
0 likes · 10 min read
How to Boost Release Quality: Proven Practices for R&D Teams
Soul Technical Team
Soul Technical Team
Sep 2, 2024 · Databases

Comparative Analysis of VictoriaMetrics and Thanos for Large‑Scale Metric Storage

This article examines the migration from Thanos to VictoriaMetrics for large‑scale metric storage, detailing background challenges, VictoriaMetrics architecture and storage engine, data write and read processes, and a comparative analysis of performance, scalability, and operational costs between the two systems.

ObservabilityThanosTime Series Database
0 likes · 15 min read
Comparative Analysis of VictoriaMetrics and Thanos for Large‑Scale Metric Storage
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Aug 30, 2024 · Cloud Native

Middleware Containerization and Cloud‑Native Transformation at OPPO

OPPO transformed its sprawling, manually‑provisioned middleware clusters into a cloud‑native, containerized platform by building custom Kubernetes controllers, IP‑preserving StatefulSets, resource‑isolated containers, automated monitoring and self‑healing workflows, enabling rapid provisioning, efficient utilization, fault‑tolerant scaling and future serverless and service‑mesh integration.

ContainerizationKubernetesOperator
0 likes · 20 min read
Middleware Containerization and Cloud‑Native Transformation at OPPO
Top Architect
Top Architect
Aug 29, 2024 · Operations

Setting Up Nginx Log Monitoring with Loki, Promtail, and Grafana

This article walks through a complete, step‑by‑step solution for collecting Nginx access logs, converting them to JSON, shipping them with Promtail to Loki, and visualizing the data in Grafana, including Docker deployment, dashboard import, and world‑map plugin installation.

GrafanaLoggingLoki
0 likes · 10 min read
Setting Up Nginx Log Monitoring with Loki, Promtail, and Grafana
FunTester
FunTester
Aug 28, 2024 · Operations

Shadow Testing: Reducing Risk and Ensuring Seamless System Changes

Shadow testing is a parallel deployment strategy that minimizes the risk of system changes, safeguards user experience, validates performance and data integrity, and provides a controlled environment for comprehensive testing, supported by a suite of modern tools and real‑world case studies.

ContainerizationDeploymentShadow Testing
0 likes · 17 min read
Shadow Testing: Reducing Risk and Ensuring Seamless System Changes
Open Source Linux
Open Source Linux
Aug 23, 2024 · Operations

10 Proven Ops Practices to Prevent System Failures

This article shares ten practical operations strategies—including change rollbacks, safe handling of destructive commands, prompt customization, rigorous backup and verification, production environment discipline, careful handovers, robust alerting, cautious automatic failover, meticulous checks, and simplicity—to dramatically improve system reliability and availability.

Incident ResponseLinuxMySQL
0 likes · 17 min read
10 Proven Ops Practices to Prevent System Failures