Tagged articles
2195 articles
Page 16 of 22
DataFunTalk
DataFunTalk
Sep 13, 2020 · Big Data

Online Sample Generation with Flink: Architecture and Implementation

This article explains why Flink is chosen for online sample generation, describes the end‑to‑end implementation steps—including stream union, state‑timer processing, and output formatting—covers state backend choices, monitoring, validation, fault handling, and platformization for scalable real‑time machine‑learning pipelines.

FlinkKafkaOnline Sample Generation
0 likes · 11 min read
Online Sample Generation with Flink: Architecture and Implementation
Java Backend Technology
Java Backend Technology
Sep 12, 2020 · Databases

Why Redis Gets Slow: Common Latency Causes and How to Diagnose Them

This article explains the typical reasons Redis latency spikes—such as high‑complexity commands, large keys, concentrated expirations, memory limits, fork overhead, CPU binding, AOF settings, swap usage, and network saturation—and provides practical steps to monitor, identify, and mitigate each issue.

MemorySlowlogmonitoring
0 likes · 18 min read
Why Redis Gets Slow: Common Latency Causes and How to Diagnose Them
ITPUB
ITPUB
Sep 11, 2020 · Blockchain

How Red Pulse Secured Its Blockchain Platform: Real‑World Attack Lessons

This article details Red Pulse's journey of integrating the NEO blockchain, the security vulnerabilities it faced—from token theft and credential‑stuffing attacks to sophisticated social‑engineering exploits—and the comprehensive technical measures, monitoring tools, and mitigation strategies it implemented to protect its platform and users.

Attack MitigationBlockchainNEO
0 likes · 21 min read
How Red Pulse Secured Its Blockchain Platform: Real‑World Attack Lessons
HaoDF Tech Team
HaoDF Tech Team
Sep 7, 2020 · Operations

Analyzing Latency and Slow Interface Detection in a Full‑Chain Monitoring System

This article explains how latency is used as a key indicator for application risk identification, defines slow interfaces, describes why percentile‑based thresholds are preferred over averages, and outlines the architecture, task workflow, and practical optimization strategies for a full‑chain monitoring system in a microservice environment.

LatencyRisk AssessmentSRE
0 likes · 14 min read
Analyzing Latency and Slow Interface Detection in a Full‑Chain Monitoring System
New Oriental Technology
New Oriental Technology
Sep 7, 2020 · Operations

Performance Optimization and Stability Enhancement of the Continuation Enrollment System

This article details the background, performance and stability requirements, strategic approach, and concrete initiatives—including full‑chain load testing, chaos engineering, monitoring, and targeted optimization projects—that were undertaken to boost the performance by over 300% and improve high‑availability of the continuation enrollment platform.

Stabilitybackend optimizationchaos testing
0 likes · 7 min read
Performance Optimization and Stability Enhancement of the Continuation Enrollment System
dbaplus Community
dbaplus Community
Sep 6, 2020 · Operations

Building a High‑Performance Monitoring Alert System with Akka, Dubbo, and Ignite

The article outlines G Bank’s transition from a single‑threaded commercial monitoring solution to a self‑developed, open‑source based alert system that leverages Akka for parallel collection, Apache Dubbo for distributed processing, and Apache Ignite for in‑memory storage, achieving million‑level alert capacity, sub‑100 ms latency, and linear scalability.

AkkaApache DubboApache Ignite
0 likes · 17 min read
Building a High‑Performance Monitoring Alert System with Akka, Dubbo, and Ignite
MaGe Linux Operations
MaGe Linux Operations
Sep 4, 2020 · Operations

Master Prometheus: From Basics to Full-Scale Monitoring Deployment

This guide walks through Prometheus fundamentals, architecture, components, service discovery, Docker-based deployment, exporter integration, Alertmanager configuration, Grafana visualization, PromQL queries, and Consul service discovery, providing a complete end‑to‑end monitoring solution for cloud‑native environments.

AlertmanagerConsulDocker
0 likes · 32 min read
Master Prometheus: From Basics to Full-Scale Monitoring Deployment
Alibaba Cloud Native
Alibaba Cloud Native
Sep 1, 2020 · Cloud Native

CTrip’s CDubbo Journey: Scaling 10k Services with Registration, Monitoring, and Service Mesh

From early .Net ESB attempts to a Java‑based CDubbo framework, CTrip details its migration to Dubbo, covering registration, health checks, CAT monitoring, dynamic configuration, SOA compatibility, testing tools, thread‑less execution, performance gains, extensibility, ecosystem integration, and future service‑mesh standardization.

Registrationcloud-nativemicroservices
0 likes · 15 min read
CTrip’s CDubbo Journey: Scaling 10k Services with Registration, Monitoring, and Service Mesh
Liangxu Linux
Liangxu Linux
Aug 29, 2020 · Operations

Enforcing Clear Git Commit Messages with a Webhook‑Based Monitoring Service

This article explains why consistent Git commit messages matter, presents a detailed commit‑message format with type, scope and subject, shows how to enforce the standard using a webhook that validates messages, monitors large commits, and provides useful statistics for the development team.

code-qualitycommit messagemonitoring
0 likes · 11 min read
Enforcing Clear Git Commit Messages with a Webhook‑Based Monitoring Service
Amap Tech
Amap Tech
Aug 28, 2020 · Fundamentals

Git Commit Message Standardization and Monitoring Service

The team introduced an Angular‑style Git commit‑message standard—type(scope): subject in Chinese—and built a webhook‑based monitoring service that validates pushes, alerts violations, tracks diff size and deletions, stores metrics, and visualizes compliance, improving traceability, readability, and automated changelog generation.

DevOpsbest-practicescommit message
0 likes · 10 min read
Git Commit Message Standardization and Monitoring Service
Java Architect Essentials
Java Architect Essentials
Aug 26, 2020 · Backend Development

A Comprehensive Guide to Evolving a Monolithic Online Store into a Robust Microservice Architecture

This article walks through the transformation of a simple online supermarket from a monolithic design to a fully fledged microservice system, explaining the motivations, architectural changes, component selection, common pitfalls, and best‑practice solutions such as service decomposition, database sharding, monitoring, tracing, service mesh, resilience patterns, and testing strategies.

ResilienceTracingarchitecture
0 likes · 22 min read
A Comprehensive Guide to Evolving a Monolithic Online Store into a Robust Microservice Architecture
Architecture Digest
Architecture Digest
Aug 25, 2020 · Operations

Best Practices and Advanced Topics for Prometheus Monitoring in Kubernetes

This article provides a comprehensive guide on using Prometheus for Kubernetes monitoring, covering fundamental principles, exporter selection, Grafana dashboard creation, memory and storage optimization, high‑availability designs, query performance, cardinality management, and integration with alerting and logging systems.

ExportersGrafanaKubernetes
0 likes · 33 min read
Best Practices and Advanced Topics for Prometheus Monitoring in Kubernetes
Aikesheng Open Source Community
Aikesheng Open Source Community
Aug 24, 2020 · Operations

Prometheus Data Query Basics and Practical Usage Guide

This article introduces Prometheus' query language PromQL, explains instant and range vector selectors, label matching, offset handling, storage design, common functions and aggregation operators, and provides practical advice for efficient querying and avoiding performance issues.

OperationsPromQLPrometheus
0 likes · 13 min read
Prometheus Data Query Basics and Practical Usage Guide
58 Tech
58 Tech
Aug 19, 2020 · Backend Development

Design and Implementation of a Testing Quality System for the 58.com SSP Advertising Platform

The article details the architecture of 58.com’s SSP advertising platform, identifies three key reliability challenges—data consistency, interface regression, and storage synchronization—and presents a three‑layer testing quality system comprising web‑layer validation, service‑layer automated testing, and data‑layer monitoring with concrete tools and future improvement plans.

MySQLRedisSSP
0 likes · 14 min read
Design and Implementation of a Testing Quality System for the 58.com SSP Advertising Platform
Open Source Linux
Open Source Linux
Aug 17, 2020 · Operations

Step-by-Step Guide to Install and Configure Zabbix on CentOS 7

This tutorial walks you through installing Zabbix on CentOS 7, covering prerequisite disabling of SELinux and firewalls, adding repositories, installing server, web, and database components, configuring files, securing MariaDB, starting services, and completing the web‑based setup with language customization.

CentOSInstallationLinux
0 likes · 7 min read
Step-by-Step Guide to Install and Configure Zabbix on CentOS 7
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Aug 16, 2020 · Cloud Native

How to Configure Alertmanager, Add WeChat Alerts, and Enable Automatic Service Discovery in Kubernetes

This guide walks through modifying Alertmanager to use a NodePort service, decoding and editing its secret to add custom receivers and a WeChat template, recreating the secret, and extending Prometheus Operator with additional scrape configs for automatic service discovery, including RBAC adjustments and verification steps.

KubernetesRBACServiceDiscovery
0 likes · 10 min read
How to Configure Alertmanager, Add WeChat Alerts, and Enable Automatic Service Discovery in Kubernetes
Tencent Cloud Developer
Tencent Cloud Developer
Aug 12, 2020 · Databases

How Autonomous Databases Evolve: From Stone Age to AI‑Driven Self‑Healing

This article traces the evolution of database autonomy from manual, knowledge‑driven operations through tool‑assisted and expert‑level stages to cloud‑native intelligent services, and details Tencent's DBbrain platform, its architecture, performance‑optimization, security, monitoring, cost‑based analysis, and future self‑healing capabilities.

AI OpsCloud DatabasesDBbrain
0 likes · 29 min read
How Autonomous Databases Evolve: From Stone Age to AI‑Driven Self‑Healing
Java Architect Essentials
Java Architect Essentials
Aug 11, 2020 · Operations

Four Essential Linux Monitoring Tools for Operations Engineers

This article introduces four widely used Linux monitoring tools—iotop, htop, IPTraf, and Monit—explaining their features, usage scenarios, and how they help operations engineers diagnose performance issues without a GUI, including real‑time I/O tracking, visual CPU/memory graphs, network traffic analysis, and flexible alerting.

IPTrafLinuxMonit
0 likes · 7 min read
Four Essential Linux Monitoring Tools for Operations Engineers
MaGe Linux Operations
MaGe Linux Operations
Aug 8, 2020 · Operations

Step-by-Step Guide to Installing and Configuring Zabbix on CentOS 7

This tutorial walks you through disabling SELinux and the firewall, adding Zabbix and EPEL repositories, installing Zabbix server, web, and database components, configuring files, securing MariaDB, starting services, and completing the web‑based setup to get a fully functional monitoring system.

CentOSInstallationOpen-source
0 likes · 7 min read
Step-by-Step Guide to Installing and Configuring Zabbix on CentOS 7
dbaplus Community
dbaplus Community
Aug 3, 2020 · Operations

How iQIYI Built a Full‑Link Automated Monitoring Platform for Microservices

iQIYI’s tech product team designed a unified full‑link automated monitoring platform that integrates link, metric, and log collection with deep analysis, enhancing fault localization, performance insight, and scalability across microservices, while addressing limitations of existing tools like ELK, Prometheus, and Dapper.

MetricsObservabilityfull‑link
0 likes · 15 min read
How iQIYI Built a Full‑Link Automated Monitoring Platform for Microservices
Xianyu Technology
Xianyu Technology
Jul 28, 2020 · Operations

ShenTan: Automated Fault Localization System for Online Services

ShenTan is an automated fault‑localization platform for online services that quickly (under five seconds) pinpoints server‑side issues with developer‑level accuracy by aggregating real‑time metrics, applying a decision‑tree model enriched by expert knowledge and dynamic thresholds, and presenting results through an integrated alert and visualization system, while planning broader endpoint coverage and multi‑tenant support.

Big DataFault LocalizationOperations
0 likes · 12 min read
ShenTan: Automated Fault Localization System for Online Services
Top Architect
Top Architect
Jul 27, 2020 · Operations

10 Practical Tips to Boost Web Application Performance Up to 10× with NGINX

This article presents ten actionable recommendations—including reverse‑proxy deployment, load balancing, caching, compression, SSL/TLS tuning, HTTP/2 adoption, software upgrades, Linux and web‑server tuning, and real‑time monitoring—to dramatically improve web application performance, often achieving tenfold speed gains.

CachingCompressionWeb Performance
0 likes · 22 min read
10 Practical Tips to Boost Web Application Performance Up to 10× with NGINX
WecTeam
WecTeam
Jul 23, 2020 · Backend Development

How We Reduced WebMonitor Latency from Minutes to Seconds – Architecture & Performance Secrets

This article chronicles the evolution of the WebMonitor front‑end monitoring system, detailing its three‑tier stack, data pipeline upgrades from raw disk sampling to HDFS and Elasticsearch, extensive collector‑side optimizations, Jetty thread and timeout tuning, and the resulting performance gains that lowered response times from minutes to sub‑second levels.

Jettydata pipelinejava
0 likes · 15 min read
How We Reduced WebMonitor Latency from Minutes to Seconds – Architecture & Performance Secrets
dbaplus Community
dbaplus Community
Jul 20, 2020 · Operations

How to Build Reliable Monitoring for Low‑Frequency Financial Services

After two years transitioning from e‑commerce to finance, the team shares practical monitoring strategies for low‑frequency financial services, contrasting e‑commerce traffic‑based methods with finance‑specific challenges, and detailing point‑based metrics, hourly success‑rate alerts, aspect‑oriented exception handling, white‑list filtering, and Sentinel‑based circuit breaking.

Aspect Oriented ProgrammingCircuit BreakingFinancial Services
0 likes · 16 min read
How to Build Reliable Monitoring for Low‑Frequency Financial Services
Liangxu Linux
Liangxu Linux
Jul 19, 2020 · Operations

How to Diagnose Linux Performance Issues with Flame Graphs and System Tools

This guide explains how to systematically analyze Linux performance problems—including CPU, memory, disk I/O, network, and load—using 5W2H methodology, built‑in monitoring commands, perf, flame‑graph visualizations, and a real‑world Nginx case study to pinpoint and resolve bottlenecks.

PerformanceTroubleshootingflamegraph
0 likes · 19 min read
How to Diagnose Linux Performance Issues with Flame Graphs and System Tools
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Jul 17, 2020 · Operations

How We Built a Robust Monitoring System for Construction Drawing Production

This article describes how our team designed and implemented a comprehensive online monitoring system for construction drawing generation, covering business background, technical architecture analysis, metric definition, monitoring methods, and the resulting dashboards that improve quality, stability, and rapid issue resolution.

MetricsOperationsconstruction drawing
0 likes · 10 min read
How We Built a Robust Monitoring System for Construction Drawing Production
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jul 12, 2020 · Operations

Monitoring Practices for Low‑Frequency Financial Services: Lessons from E‑commerce and Reliable Alerting Techniques

This article shares practical monitoring strategies for financial services with low‑frequency operations, contrasting e‑commerce monitoring methods, outlining the challenges of financial monitoring, and presenting reliable solutions such as success‑rate alerts, aspect‑oriented exception handling with whitelists, and circuit‑breaker degradation using Sentinel.

Aspect Oriented ProgrammingCircuit BreakerFinancial Services
0 likes · 14 min read
Monitoring Practices for Low‑Frequency Financial Services: Lessons from E‑commerce and Reliable Alerting Techniques
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Jul 9, 2020 · Cloud Native

Deploy and Manage Prometheus Operator on Kubernetes: A Step‑by‑Step Guide

This article explains what the Prometheus Operator is, how it extends Kubernetes with custom resources, lists the CRDs it provides, and walks through a complete deployment—including cloning the source, creating a monitoring namespace, applying RBAC, installing the operator, creating a Prometheus instance, configuring ServiceMonitor, and troubleshooting common permission errors—using concrete YAML manifests and kubectl commands.

KubernetesPrometheus OperatorRBAC
0 likes · 18 min read
Deploy and Manage Prometheus Operator on Kubernetes: A Step‑by‑Step Guide
HaoDF Tech Team
HaoDF Tech Team
Jul 8, 2020 · Operations

How We Rebuilt Our Monitoring System into a Scalable Alert Service

After two months of intensive development, the team launched a new monitoring and alerting platform that transforms a legacy system into a service‑oriented solution, addressing pain points such as inflexible escalation, noisy alerts, and poor ownership while introducing phone alerts, automated escalation, Prometheus integration, and a unified rule engine.

DevOpsPrometheusSystem Design
0 likes · 16 min read
How We Rebuilt Our Monitoring System into a Scalable Alert Service
ITPUB
ITPUB
Jul 7, 2020 · Operations

Top 2020 DevOps Tools: A Complete Guide to Building Your CI/CD Stack

This article categorizes the most popular 2020 DevOps tools across development, testing, deployment, runtime, and collaboration, explains why each tool leads its class, lists key advantages and competitors, and offers a practical checklist for assembling a full CI/CD pipeline.

CollaborationDevOpsautomation
0 likes · 24 min read
Top 2020 DevOps Tools: A Complete Guide to Building Your CI/CD Stack
ITPUB
ITPUB
Jul 5, 2020 · Operations

2020’s Best DevOps Tools by Category – From CI/CD to Collaboration

This article categorises the most popular 2020 DevOps tools—development/build, automated testing, deployment, runtime, and collaboration—explains why each tool topped its class, lists key advantages, and compares notable competitors to help teams build a complete CI/CD pipeline.

Collaborationautomationmonitoring
0 likes · 27 min read
2020’s Best DevOps Tools by Category – From CI/CD to Collaboration
dbaplus Community
dbaplus Community
Jul 2, 2020 · Information Security

How 58 Daojia Secures Data in the DT Era: Threats, Practices, and Lessons

This article summarizes Liu Huan's presentation on data security in the DT era, covering the current security landscape, internal and external threats to enterprise data, and 58 Daojia's practical approaches to data discovery, classification, authentication, monitoring, and incident response.

DT eraData Securityenterprise security
0 likes · 14 min read
How 58 Daojia Secures Data in the DT Era: Threats, Practices, and Lessons
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Jul 1, 2020 · Cloud Native

How to Install and Configure mysql_exporter on a Kubernetes Master Node

This guide walks through downloading the mysql_exporter package, extracting it on a Kubernetes master, installing the binary, creating a dedicated MySQL user with proper permissions, configuring a password‑less client file, launching the exporter, and updating Prometheus via kubectl so MySQL metrics are exposed on port 9104.

DevOpsKubernetescloud-native
0 likes · 4 min read
How to Install and Configure mysql_exporter on a Kubernetes Master Node
Top Architect
Top Architect
Jul 1, 2020 · Backend Development

Understanding Microservices Architecture: Concepts, Benefits, and Key Components

Microservices, introduced in 2012 and popularized by Martin Fowler, decompose applications into small, independent services that communicate via lightweight protocols, enabling modular development, flexible technology choices, independent deployment, and improved scalability, while also introducing challenges such as distributed data consistency, testing complexity, and operational overhead.

Backend ArchitectureConfiguration Managementapi-gateway
0 likes · 16 min read
Understanding Microservices Architecture: Concepts, Benefits, and Key Components
dbaplus Community
dbaplus Community
Jun 28, 2020 · Databases

How to Build a Visual MongoDB Slow Query Dashboard with PHP

This guide explains how to set up a PHP‑based web platform that collects MongoDB slow‑query logs via remote profiling, stores them in MySQL, and visualizes the data, including installation of required PHP extensions, database preparation, configuration, cron scheduling, and enabling profiling on MongoDB.

MongoDBPHPmonitoring
0 likes · 7 min read
How to Build a Visual MongoDB Slow Query Dashboard with PHP
Qunar Tech Salon
Qunar Tech Salon
Jun 23, 2020 · Operations

A Simple Gray Release Solution for High‑Concurrency Flight Ticket Systems

This article presents a lightweight gray release approach for complex flight ticket services, comparing traditional hardware and soft‑routing isolation methods, describing the authors' traffic‑based gray identification, business‑focused monitoring, implementation details, and automated safeguards to enable safe incremental deployments.

BackendGray ReleaseOperations
0 likes · 8 min read
A Simple Gray Release Solution for High‑Concurrency Flight Ticket Systems
Aikesheng Open Source Community
Aikesheng Open Source Community
Jun 22, 2020 · Operations

Introduction to the Prometheus Data Collection Process

This article explains the complete Prometheus data collection workflow, covering key concepts such as targets, samples, and meta labels, detailing the relabeling steps, configuration options, example use‑cases, and the final scrape and storage phases for effective monitoring.

Data CollectionPrometheusconfiguration
0 likes · 8 min read
Introduction to the Prometheus Data Collection Process
JD Retail Technology
JD Retail Technology
Jun 17, 2020 · Operations

How JD’s Data Platforms Scaled for the 618 Mega‑Sale: Operations, Stress‑Testing, and Dual‑Stream Architecture

The article details JD’s data product teams’ systematic preparation for the 618 shopping festival, covering pressure estimation, capacity expansion, stress testing, emergency downgrade strategies, dual‑data‑center isolation, high‑fidelity end‑to‑end testing, and continuous monitoring to ensure stable, real‑time data services during massive traffic spikes.

Big DataData PlatformJD.com
0 likes · 10 min read
How JD’s Data Platforms Scaled for the 618 Mega‑Sale: Operations, Stress‑Testing, and Dual‑Stream Architecture
Xianyu Technology
Xianyu Technology
Jun 17, 2020 · Backend Development

Lottery System Risk Management and SDK Integration

Xianyu mitigated lottery‑related financial loss by centralizing rights management, decoupling UI from business logic, and providing a unified SDK with simple draw APIs, while adding real‑time log backflow, comprehensive accounting and monitoring, cutting configuration time by over 50 % and eliminating UI‑only risk.

BackendLottery SystemSDK
0 likes · 10 min read
Lottery System Risk Management and SDK Integration
Laravel Tech Community
Laravel Tech Community
Jun 16, 2020 · Mobile Development

Kuaishou’s APM Platform and Mobile Performance Optimization: Insights from Yang Kai

In a mobile‑first world where limited device resources and unstable networks threaten user retention, Kuaishou’s performance team built an APM monitoring platform and applied systematic memory, startup, and jank optimizations that cut startup time by 40%, reduced package size by 23 MB, and significantly improved key product metrics.

APMKuaishouPerformance Optimization
0 likes · 9 min read
Kuaishou’s APM Platform and Mobile Performance Optimization: Insights from Yang Kai
Liangxu Linux
Liangxu Linux
Jun 13, 2020 · Operations

Mastering Monitoring: From Basics to Advanced Zabbix Practices

This comprehensive guide explains why monitoring is essential for operations, outlines monitoring goals and methods, reviews a wide range of open‑source tools, details a Zabbix‑based workflow, enumerates key metrics across hardware, system, application, network, security and business layers, and offers practical alerting and interview tips.

Operationsalertinglog analysis
0 likes · 21 min read
Mastering Monitoring: From Basics to Advanced Zabbix Practices
JD Retail Technology
JD Retail Technology
Jun 10, 2020 · Operations

Logistics R&D Preparation for the 618 Promotion: System Readiness, Stress Testing, and Real‑Time Monitoring

The logistics R&D team spent 62 days preparing for the 618 promotion by analyzing core processes, applying stress tests, implementing fault‑tolerant architectures, planning capacity, and deploying real‑time monitoring tools to ensure system stability and performance under peak traffic.

OperationsSystem Designcapacity planning
0 likes · 7 min read
Logistics R&D Preparation for the 618 Promotion: System Readiness, Stress Testing, and Real‑Time Monitoring
Manbang Technology Team
Manbang Technology Team
Jun 8, 2020 · Cloud Native

Design and Implementation of a Zookeeper Operator for Kubernetes

This article outlines the design, functional requirements, CRD definition, architecture, deployment, scaling, monitoring, fault‑tolerance, and upgrade strategies of a Zookeeper operator on Kubernetes, including code examples, service configurations, and integration with Prometheus and OAM standards.

CRDKubernetesOperator
0 likes · 18 min read
Design and Implementation of a Zookeeper Operator for Kubernetes
Efficient Ops
Efficient Ops
Jun 3, 2020 · Operations

Understanding Kubernetes vs VM Monitoring: CPU, Memory, Disk & Network

This article compares monitoring metrics for CPU, memory, disk, and network between traditional KVM-based servers and Kubernetes pods, explaining why their indicators differ, how resource isolation works, and what key metrics users should watch to diagnose performance bottlenecks.

CPUKubernetesMemory
0 likes · 11 min read
Understanding Kubernetes vs VM Monitoring: CPU, Memory, Disk & Network
iQIYI Technical Product Team
iQIYI Technical Product Team
May 29, 2020 · Big Data

iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation

iQiyi’s full‑link automated monitoring platform unifies tracing, metric and log collection with deep offline and real‑time analysis, delivering a DAG‑based call graph, near‑real‑time ingestion of tens of millions of logs, multi‑dimensional alerts and rapid root‑cause diagnosis that cut error‑lookup time by over 50 % and now serves as a core component of the company’s microservice reference architecture.

Big DataLoggingMetrics
0 likes · 12 min read
iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation
FunTester
FunTester
May 26, 2020 · Fundamentals

Understanding Load Testing: Key Strategies and Best Practices

This article clarifies common misconceptions about load testing, defines it within performance testing, and provides practical strategies for test volume, load generators, scripting, think time, ramp-up/down, monitoring, diagnosis, and data analysis to ensure reliable performance assessments.

Test Strategymonitoringsoftware testing
0 likes · 11 min read
Understanding Load Testing: Key Strategies and Best Practices
dbaplus Community
dbaplus Community
May 25, 2020 · Operations

Scaling CAT Monitoring at Ctrip: Thread Model, Client Computation & Memory Tweaks

This article details how Ctrip optimized the CAT monitoring system—covering its large‑scale deployment, thread‑model redesign, offloading calculations to clients, double‑buffered reporting, and string handling improvements—to dramatically cut CPU usage, GC pressure, and memory consumption while handling billions of messages daily.

Distributed SystemsPerformance OptimizationThread Model
0 likes · 25 min read
Scaling CAT Monitoring at Ctrip: Thread Model, Client Computation & Memory Tweaks
Programmer DD
Programmer DD
May 22, 2020 · Operations

Grafana 7.0 Released: New UX, Plugin Platform, Transformations & CloudWatch Support

Grafana 7.0 introduces a revamped user experience, a unified data model, a new plugin platform, Jaeger tracing support, powerful data transformations, AWS CloudWatch Logs integration, and enterprise usage analytics, offering enhanced visualization and monitoring capabilities across major data sources.

Data visualizationGrafanaObservability
0 likes · 3 min read
Grafana 7.0 Released: New UX, Plugin Platform, Transformations & CloudWatch Support
Top Architect
Top Architect
May 21, 2020 · Backend Development

Comprehensive Guide to Java Application Performance Optimization and Diagnosis

This article provides an in‑depth overview of Java application performance optimization, covering a four‑layer model (application, database, framework, JVM), on‑site and post‑mortem analysis methods, OS and JVM diagnostic tools, common code and GC issues, database deadlock handling, and practical tuning recommendations.

Database TuningJVMPerformance Optimization
0 likes · 23 min read
Comprehensive Guide to Java Application Performance Optimization and Diagnosis
Efficient Ops
Efficient Ops
May 20, 2020 · Operations

How to Build a Sustainable CMDB: Three Essential Phases for Reliable Operations

This article explains how to design, implement, and maintain a robust Configuration Management Database (CMDB) by focusing on simple modeling, establishing data closure loops, and efficiently handling existing inventory, while leveraging Kafka, Flink, Elasticsearch, and Neo4j for fast querying and topology visualization.

CMDBConfiguration Managementautomation
0 likes · 19 min read
How to Build a Sustainable CMDB: Three Essential Phases for Reliable Operations
Efficient Ops
Efficient Ops
May 19, 2020 · Cloud Native

Mastering Prometheus on Kubernetes: Practical Tips, Exporter Guide, and Capacity Planning

This article explores the history and principles of Prometheus monitoring, offers guidance on version selection, highlights its limitations, details common Kubernetes exporters, shows Grafana dashboard setups, and provides in‑depth strategies for exporter aggregation, golden metrics, multi‑cluster scraping, GPU monitoring, timezone handling, memory optimization, capacity planning, and rate calculations.

GrafanaKubernetesPrometheus
0 likes · 19 min read
Mastering Prometheus on Kubernetes: Practical Tips, Exporter Guide, and Capacity Planning
HomeTech
HomeTech
May 14, 2020 · Cloud Native

Design and Implementation of the Next‑Generation Cloud‑Native Monitoring System at Autohome

The article describes Autohome's third‑generation cloud‑native monitoring platform, detailing its background, strategic goals for R&D efficiency, mobile‑first design, Prometheus‑based architecture with multi‑replica storage and InfluxDB remote storage, its operational impact, and future directions such as AI‑driven noise reduction.

Containerscloud-nativemonitoring
0 likes · 7 min read
Design and Implementation of the Next‑Generation Cloud‑Native Monitoring System at Autohome
Programmer DD
Programmer DD
May 12, 2020 · Operations

Boost RabbitMQ Reliability: Proven Strategies for Producers, Consumers, and Ops

This comprehensive guide explains how to enhance RabbitMQ reliability by covering confirmation mechanisms, producer and consumer configurations, queue mirroring, alerting, monitoring metrics, and health‑check commands, providing actionable steps for developers and operations teams to ensure stable message delivery.

Message queueOperationsRabbitMQ
0 likes · 22 min read
Boost RabbitMQ Reliability: Proven Strategies for Producers, Consumers, and Ops
MaGe Linux Operations
MaGe Linux Operations
May 10, 2020 · Databases

How to Build a Complete MySQL Monitoring Dashboard with Prometheus and Grafana

This guide walks through deploying mysqld_exporter, configuring Prometheus and Grafana, and monitoring essential MySQL metrics such as replication health, query throughput, slow‑query counts, connection usage, and InnoDB buffer‑pool statistics, while also showing how to set up alert rules for proactive database operations.

ExportersGrafanaMySQL
0 likes · 15 min read
How to Build a Complete MySQL Monitoring Dashboard with Prometheus and Grafana
ITPUB
ITPUB
May 3, 2020 · Operations

Mastering IT Monitoring: Goals, Methods, Tools, and Best Practices

This comprehensive guide explains why monitoring is essential for reliable operations, outlines clear monitoring objectives, walks through practical monitoring methods, compares popular open‑source tools, details a Zabbix‑based workflow, and lists key hardware, system, application, network, security, API, performance, and business metrics to track.

IT infrastructureOperationsmonitoring
0 likes · 19 min read
Mastering IT Monitoring: Goals, Methods, Tools, and Best Practices
Laravel Tech Community
Laravel Tech Community
May 2, 2020 · Operations

Comprehensive MySQL and Linux Operations Interview Guide

This guide compiles essential MySQL security steps, master‑slave replication principles, backup scripts, Linux boot overview, common port services, virus mitigation, monitoring tools, nginx optimization, InnoDB lock troubleshooting, replication lag reduction, high‑availability components, data migration utilities, and automation configuration management techniques for operations engineers.

DatabaseLinuxMySQL
0 likes · 13 min read
Comprehensive MySQL and Linux Operations Interview Guide
Top Architect
Top Architect
May 1, 2020 · Operations

Comprehensive Guide to Java Runtime Error Diagnosis: CPU, Memory, Disk, GC, and Network Troubleshooting

This article presents a systematic approach to diagnosing and resolving Java runtime problems by examining CPU usage, disk I/O, memory consumption, garbage‑collection behavior, and network anomalies, offering practical commands, analysis techniques, and visual aids to pinpoint root causes in production environments.

OperationsPerformanceTroubleshooting
0 likes · 22 min read
Comprehensive Guide to Java Runtime Error Diagnosis: CPU, Memory, Disk, GC, and Network Troubleshooting
Liangxu Linux
Liangxu Linux
Apr 29, 2020 · Operations

How to Build a Complete Monitoring System: Goals, Methods, Tools & Best Practices

This guide explains why monitoring is essential for the entire operations lifecycle, outlines key monitoring objectives, describes practical methods and workflows, reviews a range of open‑source tools (including Zabbix, MRTG, Ganglia, Nagios, Smokeping, OpenTSDB), and details metric categories such as hardware, system, application, network, log, security, API, performance and business monitoring.

Metricsalertingmonitoring
0 likes · 22 min read
How to Build a Complete Monitoring System: Goals, Methods, Tools & Best Practices
vivo Internet Technology
vivo Internet Technology
Apr 29, 2020 · Cloud Native

Prometheus Architecture and Design Principles: A Deep Dive into Cloud-Native Monitoring

Prometheus, a CNCF‑graduated, cloud‑native monitoring system, combines pull‑based target discovery, a label‑rich time‑series data model, and four core metric types—gauge, counter, histogram, and summary—to provide near‑real‑time visibility, short‑term retention, alerting via AlertManager, and integration with Grafana and remote storage for scalable observability.

AlertmanagerCNCFDevOps
0 likes · 11 min read
Prometheus Architecture and Design Principles: A Deep Dive into Cloud-Native Monitoring
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Apr 29, 2020 · Operations

How Our Team Built a Stable SIT Environment: Lessons in Test Environment Governance

This article documents the step‑by‑step practices of a six‑person test‑environment availability team that unified middleware, streamlined deployment pipelines, piloted business usage, introduced monitoring and recovery mechanisms, and created a comprehensive SIT environment handbook to improve integration testing stability and operational efficiency.

OperationsSITdeployment
0 likes · 19 min read
How Our Team Built a Stable SIT Environment: Lessons in Test Environment Governance
UCloud Tech
UCloud Tech
Apr 28, 2020 · Cloud Native

How We Built a Highly Available Kubernetes Platform for Multi‑Cluster Deployments

This article explains why Kubernetes was chosen, describes the overall architecture, high‑availability master design, multi‑IDC cluster deployment, logging, monitoring, service exposure, image building, lifecycle hooks, CI/CD, multi‑cluster management, encountered challenges, and future plans for operators and automated scaling.

KubernetesMulti-Clusterci/cd
0 likes · 11 min read
How We Built a Highly Available Kubernetes Platform for Multi‑Cluster Deployments
dbaplus Community
dbaplus Community
Apr 22, 2020 · Operations

How 58 Daojia Built a Cloud‑Native Ops Platform to Streamline Migration and Cut Costs

This article recounts 58 Daojia’s four‑year journey from migrating its IDC infrastructure to public cloud, the challenges encountered, and how the team designed and evolved a multi‑generation operations platform that centralizes asset, cost, domain, and monitoring management, ultimately improving efficiency and reducing expenses.

Cost Managementasset managementcloud migration
0 likes · 14 min read
How 58 Daojia Built a Cloud‑Native Ops Platform to Streamline Migration and Cut Costs
21CTO
21CTO
Apr 16, 2020 · Backend Development

How JD’s API Gateway Handles Tens of Millions of Concurrent Requests

This article explains how JD Retail built a high‑performance, secure, and observable API gateway that supports massive traffic, implements asynchronous processing for high concurrency, provides fine‑grained traffic control, gray‑release capabilities, and automated operations to serve native, web, and mini‑program clients.

Gray ReleaseSecurityapi-gateway
0 likes · 10 min read
How JD’s API Gateway Handles Tens of Millions of Concurrent Requests
FunTester
FunTester
Apr 14, 2020 · Operations

Spot Performance Problems Without Writing a Single Line of Code

Experienced developers can often identify performance bottlenecks simply by reviewing code implementations, configuration settings such as timeouts, intervals, database and Redis parameters, as well as service monitoring data, container and JVM configurations, allowing them to avoid unnecessary test scripts and code changes.

DevOpsOperationsOptimization
0 likes · 2 min read
Spot Performance Problems Without Writing a Single Line of Code
Cloud Native Technology Community
Cloud Native Technology Community
Apr 8, 2020 · Operations

Decoding Thanos Architecture: From Query to Compact for Scalable Monitoring

This article provides a detailed analysis of Thanos' architecture, explaining each core component—Query, Sidecar, Store Gateway, Ruler, Compact, and the upcoming Receiver—how they enable global view, high availability, and long‑term storage for distributed Prometheus deployments, and discusses design trade‑offs and optimization strategies.

Long‑term StorageObservabilityPrometheus
0 likes · 12 min read
Decoding Thanos Architecture: From Query to Compact for Scalable Monitoring
Ops Development Stories
Ops Development Stories
Apr 8, 2020 · Operations

Deploy Zabbix Monitoring with Docker and Docker‑Compose on CentOS

This guide walks through preparing a CentOS 7 host, installing Docker, configuring a Zabbix server and MySQL containers, and optionally using docker‑compose to set up Zabbix components, including the web UI and agent, with detailed commands and volume mappings for persistent monitoring.

CentOSDockerdocker-compose
0 likes · 18 min read
Deploy Zabbix Monitoring with Docker and Docker‑Compose on CentOS