Tagged articles
983 articles
Page 7 of 10
Baidu Geek Talk
Baidu Geek Talk
Feb 20, 2023 · Operations

Deep Dive into Logging Operations and Observability in Distributed Systems

The article examines logging’s critical role in distributed systems, detailing its purpose, severity levels, and value for debugging, performance, security, and auditing, while highlighting challenges of inconsistent formats and traceability, and reviewing observability pillars, ELK and tracing tools, and practical implementation best practices.

APMELKLogging
0 likes · 19 min read
Deep Dive into Logging Operations and Observability in Distributed Systems
Alibaba Cloud Native
Alibaba Cloud Native
Feb 8, 2023 · Cloud Native

Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark

This article benchmarks Alibaba Cloud Prometheus against the open‑source Prometheus across multiple cluster sizes, churn rates, and query patterns, revealing that while the open‑source version remains stable under light load, its CPU and memory usage grow non‑linearly with high cardinality, whereas Alibaba's managed service delivers higher compatibility, better query performance, and more predictable scaling.

Cloud NativeMetricsObservability
0 likes · 30 min read
Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark
Cloud Native Technology Community
Cloud Native Technology Community
Feb 8, 2023 · Operations

FinOps Core Principles and the Rationale for Left‑Shift in Cloud Cost Management

The article explains how DevOps teams can adopt FinOps principles and a left‑shift approach—combining static and dynamic logging, fostering cross‑team collaboration, and integrating cost awareness into the software development lifecycle—to reduce cloud expenses, improve MTTR, and drive sustainable engineering productivity.

Cloud CostDevOpsFinOps
0 likes · 10 min read
FinOps Core Principles and the Rationale for Left‑Shift in Cloud Cost Management
dbaplus Community
dbaplus Community
Feb 6, 2023 · Operations

How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services

This article outlines Vivo's multi‑year journey of designing, evolving, and operating a cloud‑native, AIOps‑enabled monitoring platform that supports tens of thousands of hosts, databases, containers, and services, detailing its architecture, challenges, and future directions for observability and reliability.

ObservabilityOperationsSystem architecture
0 likes · 18 min read
How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services
Tencent Cloud Developer
Tencent Cloud Developer
Feb 3, 2023 · Cloud Computing

Cloud Load Testing: Strategies, Scenarios, and Practice Cases for High‑Traffic Events

Tencent’s cloud load‑testing platform simulates massive Chinese‑New‑Year traffic by offering concurrency and RPS modes, multi‑language test authoring, realistic data generation, and unified OpenTelemetry reporting, enabling early bottleneck detection, proactive scaling, and successful high‑load drills such as Mobile QQ and video services.

JavaScriptObservabilitycloud testing
0 likes · 23 min read
Cloud Load Testing: Strategies, Scenarios, and Practice Cases for High‑Traffic Events
Open Source Linux
Open Source Linux
Feb 3, 2023 · Cloud Native

Why eBPF Is the Secret Weapon Behind Modern Cloud‑Native Platforms

This article explains how eBPF extends kernel functionality to enable secure, high‑performance networking, observability, and programmable workloads in cloud‑native environments, detailing its architecture, use cases, market adoption, commercialization models, and the challenges and advantages that make it comparable to JavaScript for the kernel.

Cloud NativeLinuxObservability
0 likes · 12 min read
Why eBPF Is the Secret Weapon Behind Modern Cloud‑Native Platforms
Architects Research Society
Architects Research Society
Feb 2, 2023 · Backend Development

Medium’s Journey to Microservices: Principles, Strategies, and Lessons Learned

This article explains why Medium transitioned from a monolithic Node.js application to a microservice architecture, outlines the core design principles, shares practical strategies for building, deploying, and observing services, and warns about common pitfalls such as the microservice syndrome.

Backend DevelopmentDeploymentObservability
0 likes · 23 min read
Medium’s Journey to Microservices: Principles, Strategies, and Lessons Learned
ITPUB
ITPUB
Jan 31, 2023 · Databases

How Pigsty Turns PostgreSQL into a Cost‑Effective Open‑Source RDS Alternative

Pigsty is an open‑source platform that upgrades PostgreSQL across six dimensions—observability, reliability, availability, maintainability, extensibility, and interoperability—delivering enterprise‑grade features, built‑in monitoring, automatic failover, backup, and performance tuning while cutting cloud database costs dramatically.

ObservabilityOpen SourcePostgreSQL
0 likes · 22 min read
How Pigsty Turns PostgreSQL into a Cost‑Effective Open‑Source RDS Alternative
dbaplus Community
dbaplus Community
Jan 26, 2023 · Operations

Unified Metrics, Tracing, and Logging: A Financial Firm’s Path to Microservice Observability

Facing the challenges of distributed microservice architectures, a financial services company implemented a unified observability platform that combines metrics, tracing, and logging via OpenTelemetry and custom agents, enabling real‑time visualization, anomaly detection, and performance analysis across seven core business middle‑platforms.

LoggingMetricsObservability
0 likes · 17 min read
Unified Metrics, Tracing, and Logging: A Financial Firm’s Path to Microservice Observability
MaGe Linux Operations
MaGe Linux Operations
Jan 23, 2023 · Operations

Prometheus vs Zabbix: Which Monitoring Tool Wins in Modern Environments?

This article compares Prometheus and Zabbix, detailing their histories, architectures, performance, community support, and suitability for different environments, and concludes with guidance on choosing the right monitoring solution for physical servers, cloud-native deployments, and large‑scale container clusters.

Cloud NativeObservabilityzabbix
0 likes · 9 min read
Prometheus vs Zabbix: Which Monitoring Tool Wins in Modern Environments?
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Jan 16, 2023 · Backend Development

Design and Implementation of a Business‑Facing Message Center Management Platform

The platform centralizes message‑center management for e‑commerce by adding end‑to‑end tracing, real‑time metrics, and unified logging, enabling business users to query message links, view dashboards, automate retries and approvals, dramatically reducing manual monitoring, improving completion rates above 90%, and paving the way for cost‑optimized, data‑driven operations.

DevOpsLoggingMetrics
0 likes · 15 min read
Design and Implementation of a Business‑Facing Message Center Management Platform
Code Ape Tech Column
Code Ape Tech Column
Jan 14, 2023 · Operations

Comparison of Common Log Management Tools: Features, Pricing, Pros and Cons

This article provides a detailed comparison of nine popular log management solutions—including Filebeat, Graylog, LogDNA, the ELK stack, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their main features, pricing models, advantages, and disadvantages to help readers choose the right tool for their needs.

ELKLog ManagementObservability
0 likes · 16 min read
Comparison of Common Log Management Tools: Features, Pricing, Pros and Cons
Su San Talks Tech
Su San Talks Tech
Jan 13, 2023 · Operations

How Distributed Tracing with SkyWalking Solves Microservice Performance Mysteries

This article explains the principles, architecture, and practical implementation of distributed tracing—especially SkyWalking—in microservice environments, showing how it identifies call chains, isolates performance bottlenecks, and integrates with existing monitoring systems while maintaining low overhead and non‑intrusive instrumentation.

JavaAgentObservabilitydistributed tracing
0 likes · 20 min read
How Distributed Tracing with SkyWalking Solves Microservice Performance Mysteries
Top Architect
Top Architect
Jan 6, 2023 · Operations

Understanding Distributed Tracing and SkyWalking: Principles, Architecture, and Performance

This article explains the concept of distributed tracing, its importance in micro‑service architectures, the OpenTracing standard, and how SkyWalking implements automatic span collection, context propagation, unique trace IDs, sampling strategies, and performance optimizations to provide low‑overhead observability for backend systems.

ObservabilityOpenTracingdistributed tracing
0 likes · 12 min read
Understanding Distributed Tracing and SkyWalking: Principles, Architecture, and Performance
Tencent Cloud Developer
Tencent Cloud Developer
Jan 5, 2023 · Cloud Native

QQ Music High-Availability Architecture Overview

QQ Music achieves high availability by layering redundant multi‑datacenter architecture, proactive chaos‑engineering toolchains, and comprehensive observability—including metrics, logging, tracing and profiling—while employing service grading, adaptive retry windows and EMA‑based dynamic timeouts to gracefully handle faults across its massive micro‑service ecosystem.

Distributed SystemsObservabilitychaos engineering
0 likes · 24 min read
QQ Music High-Availability Architecture Overview
Architecture & Thinking
Architecture & Thinking
Jan 5, 2023 · Operations

How Critical Path Tracing Cuts Latency in Large Distributed Systems

This article explains why latency analysis is crucial for user experience in large distributed services, reviews common methods such as RPC monitoring, CPU profiling, and distributed tracing, and then dives deep into the principles, implementation, aggregation, storage, and visualization of critical path analysis, showcasing its practical impact in Baidu's App recommendation platform.

Latency analysisObservabilitycritical path tracing
0 likes · 15 min read
How Critical Path Tracing Cuts Latency in Large Distributed Systems
Alibaba Terminal Technology
Alibaba Terminal Technology
Jan 5, 2023 · Mobile Development

Why Mobile Trace Is Hard and How OpenTelemetry Solves It

This article explores the challenges of end‑to‑end tracing on mobile apps, explains why issues are hard to reproduce, and presents a four‑step solution using a unified OpenTelemetry standard, automated data linking, performance optimizations, and machine‑learning‑driven root‑cause analysis.

AndroidObservabilityOpenTelemetry
0 likes · 20 min read
Why Mobile Trace Is Hard and How OpenTelemetry Solves It
Architecture Digest
Architecture Digest
Dec 30, 2022 · Operations

Vivo Monitoring Platform: Architecture, Evolution, and Future Directions

The article details the evolution, architecture, capabilities, challenges, and future plans of Vivo's comprehensive monitoring platform, covering its transition from simple Zabbix setups to a cloud‑native, AI‑ops enabled system that ensures service availability across massive infrastructure.

ObservabilityPlatformReliability
0 likes · 16 min read
Vivo Monitoring Platform: Architecture, Evolution, and Future Directions
Efficient Ops
Efficient Ops
Dec 29, 2022 · Operations

How eBay Scales Its Event Platform with ClickHouse and Kubernetes

This article details eBay's event platform architecture, explaining why a dedicated event system is needed, how ClickHouse provides high‑performance storage, the use of Kubernetes CRDs for cross‑region high availability, data routing, read/write separation, and query optimizations with LogQL.

ClickHouseEvent PlatformKubernetes
0 likes · 18 min read
How eBay Scales Its Event Platform with ClickHouse and Kubernetes
Meituan Technology Team
Meituan Technology Team
Dec 29, 2022 · Artificial Intelligence

Top 20 Most Popular Meituan Tech Blog Articles of 2022

Meituan’s technology team highlights its twenty most‑read 2022 blog posts, spanning observability, system design, data governance, AI, cloud‑native engineering, and practical innovations such as visual log tracing, Kafka scaling, functional programming, Elasticsearch optimization, CI/CD pipelines, and advanced object‑detection frameworks.

2022 HighlightsArtificial IntelligenceData Governance
0 likes · 13 min read
Top 20 Most Popular Meituan Tech Blog Articles of 2022
Tencent Cloud Developer
Tencent Cloud Developer
Dec 28, 2022 · Operations

Technical Architecture, Observability, and Operational Practices of Tencent Health Code System

The article details how Tencent’s health‑code platform leveraged a cloud‑native, serverless architecture, extensive observability (Prometheus, Grafana, RUM), rigorous capacity testing, chaos engineering, and ITIL‑based change management to sustain billions of page views, support massive concurrency, and ensure reliable, scalable epidemic‑control services.

Health CodeObservabilityOperations
0 likes · 16 min read
Technical Architecture, Observability, and Operational Practices of Tencent Health Code System
IT Architects Alliance
IT Architects Alliance
Dec 24, 2022 · Operations

Unlocking Linux Observability: A Hands‑On Guide to eBPF with Real‑World Examples

This article introduces eBPF, explains its origins and how it extends BPF for kernel‑level observability, compares it with SystemTap and DTrace, outlines common use cases, details its loading‑compile‑execute workflow, and provides step‑by‑step Python/BCC examples with installation instructions and advanced latency measurement code.

BCCLinuxObservability
0 likes · 21 min read
Unlocking Linux Observability: A Hands‑On Guide to eBPF with Real‑World Examples
ITPUB
ITPUB
Dec 20, 2022 · Operations

How We Scaled SkyWalking to Billions of Segments: A Full‑Stack Monitoring Journey

This article recounts a year‑long, hands‑on experience of deploying and continuously optimizing Apache SkyWalking for full‑link monitoring in a large micro‑service environment, covering the motivations, architecture choices, pre‑research, POC integration, and a series of performance‑tuning steps that reduced segment storage from billions to millisecond‑level query latency.

APMFull-Stack MonitoringObservability
0 likes · 21 min read
How We Scaled SkyWalking to Billions of Segments: A Full‑Stack Monitoring Journey
Inke Technology
Inke Technology
Dec 19, 2022 · Backend Development

How to Build a Highly Available, Stable, and Observable SMS Service

This article explains how to design a high‑availability SMS system by identifying stability bottlenecks, defining reliability goals, implementing failover strategies for Redis, MySQL and external services, establishing a comprehensive observability framework, and measuring key quality metrics to ensure 99.99% uptime.

MetricsObservabilitySMS
0 likes · 11 min read
How to Build a Highly Available, Stable, and Observable SMS Service
DeWu Technology
DeWu Technology
Dec 5, 2022 · Operations

Evolution of Application Monitoring at 得物: From CAT to OpenTelemetry

After rebuilding its transaction system in 2020, 得物 progressed from the basic CAT monitoring tool to OpenTracing with Prometheus, and finally adopted OpenTelemetry to unify metrics, traces, and logs via a custom vmagent‑Kafka‑Flink pipeline, dynamic sampling, and extensible javaagents, positioning the platform for a performance‑analysis‑driven future.

CATObservabilityOpenTelemetry
0 likes · 18 min read
Evolution of Application Monitoring at 得物: From CAT to OpenTelemetry
ITPUB
ITPUB
Dec 4, 2022 · Databases

Can National Standards Accelerate the Growth of China's Domestic Databases?

The article examines whether establishing national standards for Chinese domestic databases can foster industry development, weighing the risks of over‑regulation against the benefits of standardized observability, data‑dictionary, cloud‑integration, and programming interfaces, while sharing real‑world migration experiences.

Chinese databasesDatabase StandardsObservability
0 likes · 11 min read
Can National Standards Accelerate the Growth of China's Domestic Databases?
Efficient Ops
Efficient Ops
Dec 1, 2022 · Operations

Why Choose Loki Over ELK? A Hands‑On Guide to Deploying and Using Grafana Loki

This article explains the motivations for selecting Grafana Loki instead of ELK/EFK, introduces its core concepts and features, provides step‑by‑step deployment instructions for Promtail and Loki, and demonstrates how to configure Grafana, query logs, and handle label indexing, dynamic tags, and high‑cardinality challenges.

GrafanaKubernetesLoki
0 likes · 15 min read
Why Choose Loki Over ELK? A Hands‑On Guide to Deploying and Using Grafana Loki
DataFunTalk
DataFunTalk
Nov 27, 2022 · Operations

Best Practices for Full‑Stack Operations Monitoring and Cost Reduction Using Alibaba Cloud Elasticsearch

This article presents a comprehensive, three‑part guide on the current state of full‑stack operations monitoring, common challenges and solutions, and a real‑world use case, illustrating how Alibaba Cloud Elasticsearch can improve observability, boost performance, and cut costs for complex distributed systems.

ElasticsearchObservabilityOperations
0 likes · 13 min read
Best Practices for Full‑Stack Operations Monitoring and Cost Reduction Using Alibaba Cloud Elasticsearch
Programmer DD
Programmer DD
Nov 23, 2022 · Backend Development

Spring Boot 3.0.0: Key Updates and How to Get Ready

The article outlines the recent Spring 6.0 release, the cascade of updates across major Spring projects, and previews the upcoming Spring Boot 3.0.0, highlighting the first RC, the new aot.factories feature, and enhanced observability for Java developers.

ObservabilitySpring 6spring-boot
0 likes · 3 min read
Spring Boot 3.0.0: Key Updates and How to Get Ready
ByteDance Terminal Technology
ByteDance Terminal Technology
Nov 18, 2022 · Big Data

Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance

This article presents ByteDance’s experience building a massive trace‑data analysis platform, covering observability fundamentals, the evolution of its distributed tracing system, various aggregation computation models, technical architecture choices, and concrete use‑cases such as precise topology, traffic estimation, dependency analysis, performance anti‑patterns, bottleneck detection, and error propagation.

Big DataGraph DatabaseObservability
0 likes · 21 min read
Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance
Alibaba Cloud Native
Alibaba Cloud Native
Nov 17, 2022 · Cloud Native

How RocketMQ Harnesses Prometheus for Full‑Stack Observability

This article explains how RocketMQ integrates with Prometheus and Grafana to provide comprehensive metrics, tracing, and logging, detailing the exporter architecture, deployment choices, span topology, dashboard examples, and ARMS‑based alerting for cloud‑native message‑queue observability.

ARMSCloud NativeMetrics
0 likes · 14 min read
How RocketMQ Harnesses Prometheus for Full‑Stack Observability
21CTO
21CTO
Nov 15, 2022 · Operations

Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems

This article explains how SRE teams should define Service Level Indicators, Objectives and Agreements, manage reliability, performance, saturation and observability, use proper metrics and tracing, handle error budgets, assess risks, and implement effective incident and project management to create robust, cloud‑native services.

Error BudgetObservabilityReliability
0 likes · 14 min read
Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems
DevOps Cloud Academy
DevOps Cloud Academy
Nov 13, 2022 · Cloud Native

Grafana Phlare: Open‑Source Continuous Profiling Database – Architecture, Features, and Kubernetes Deployment Guide

Grafana Phlare is an open‑source, horizontally scalable continuous profiling database that integrates with Grafana, offering easy installation, multi‑tenant support, and object‑storage‑backed long‑term storage, with detailed deployment instructions for both monolithic and micro‑service modes on Kubernetes using Helm.

Continuous ProfilingGrafanaKubernetes
0 likes · 11 min read
Grafana Phlare: Open‑Source Continuous Profiling Database – Architecture, Features, and Kubernetes Deployment Guide
Open Source Linux
Open Source Linux
Nov 7, 2022 · Cloud Native

Unlock Scalable Cloud‑Native Alerting with Grafana Mimir: Architecture & Setup

This article explains the current state of cloud‑native alerting, introduces Grafana Mimir as a horizontally scalable, multi‑tenant storage for Prometheus, details its architecture and components, and provides step‑by‑step guidance for installing, configuring, and operating Mimir in Kubernetes environments.

Cloud NativeKubernetesMimir
0 likes · 24 min read
Unlock Scalable Cloud‑Native Alerting with Grafana Mimir: Architecture & Setup
政采云技术
政采云技术
Nov 7, 2022 · Cloud Native

Deployment Architecture of a Government Procurement Cloud Platform Based on Dragonfly OS

The article details Zhengcaiyun's government procurement cloud platform, its large‑scale architecture, migration to the domestically‑adapted Dragonfly operating system, integrated cloud‑native operations, observability built on OpenTelemetry, and ongoing efforts to enhance security, performance, and ecosystem collaboration.

Dragonfly OSObservabilitycloud-native
0 likes · 7 min read
Deployment Architecture of a Government Procurement Cloud Platform Based on Dragonfly OS
政采云技术
政采云技术
Nov 7, 2022 · Cloud Native

Deployment Architecture of a Government Procurement Cloud Platform Based on the Longxi Operating System

The article details Zhengcaiyun's government procurement cloud platform, its large‑scale deployment architecture built on the Longxi OS, covering cloud‑native design, domestic adaptation, observability, and operational practices that enable high‑performance, secure, and scalable public procurement services.

Observabilitycloud-nativeeBPF
0 likes · 6 min read
Deployment Architecture of a Government Procurement Cloud Platform Based on the Longxi Operating System
Alibaba Cloud Native
Alibaba Cloud Native
Nov 3, 2022 · Cloud Native

How to Leverage Alibaba Cloud Prometheus for Fine‑Grained Cloud Product Monitoring

This guide explains why native cloud monitoring falls short, how building custom Prometheus exporters adds overhead, and how Alibaba Cloud's fully managed Prometheus service—through enterprise cloud‑monitoring and self‑monitoring integration modes—provides ready‑to‑use exporters, agents, Grafana dashboards, and alert templates for dozens of cloud products.

Alibaba CloudCloud NativeGrafana
0 likes · 12 min read
How to Leverage Alibaba Cloud Prometheus for Fine‑Grained Cloud Product Monitoring
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Oct 24, 2022 · Backend Development

Understanding Zipkin: Principles, Architecture, Core Components, and Deployment for Distributed Tracing

This article explains why Zipkin is needed for microservice observability, describes its architecture, core components, trace and span model, workflow, and provides step‑by‑step Docker and JAR deployment instructions, helping developers quickly locate service bottlenecks and failures.

Backend DevelopmentObservabilitydistributed tracing
0 likes · 7 min read
Understanding Zipkin: Principles, Architecture, Core Components, and Deployment for Distributed Tracing
Software Development Quality
Software Development Quality
Oct 23, 2022 · Operations

Top Observability Tools: Datadog, Grafana, Instana, New Relic, Prometheus

This article provides an overview of five leading observability solutions—Datadog, Grafana, Instana, New Relic, and Prometheus—detailing their core features, supported data sources, deployment models, and how they help teams monitor cloud‑native applications, infrastructure, and services to ensure reliability and performance.

DevOpsObservabilitycloud-native
0 likes · 4 min read
Top Observability Tools: Datadog, Grafana, Instana, New Relic, Prometheus
Programmer DD
Programmer DD
Oct 21, 2022 · Cloud Native

How Grafana Mimir Transforms Cloud‑Native Monitoring and Alerting

This article explains how Grafana Mimir provides a scalable, highly‑available, multi‑tenant long‑term storage for Prometheus, details its architecture and core components such as compactor, distributor, ingester, querier, query‑frontend and store‑gateway, and shows step‑by‑step installation, status checking, and Alertmanager configuration for cloud‑native environments.

AlertmanagerCloud Native MonitoringGrafana Mimir
0 likes · 22 min read
How Grafana Mimir Transforms Cloud‑Native Monitoring and Alerting
Alibaba Cloud Native
Alibaba Cloud Native
Oct 19, 2022 · Cloud Native

How to Monitor Non‑Kubernetes ECS Apps with Alibaba Cloud Managed Prometheus

This guide explains how to use Alibaba Cloud's fully managed Prometheus service to collect and visualize metrics from ECS‑based applications across pure VPC, hybrid VPC‑IDC, and multi‑cloud scenarios, detailing the pain points of self‑built solutions and providing step‑by‑step configuration instructions.

Alibaba CloudECSObservability
0 likes · 11 min read
How to Monitor Non‑Kubernetes ECS Apps with Alibaba Cloud Managed Prometheus
Top Architect
Top Architect
Oct 18, 2022 · Operations

Apache SkyWalking APM: Concepts, Docker Installation, and UI Guide

This article introduces Application Performance Management (APM), explains the features of Apache SkyWalking for micro‑service and cloud‑native monitoring, and provides step‑by‑step Docker‑compose installation, agent configuration, and a detailed walkthrough of the SkyWalking UI components.

APMDockerObservability
0 likes · 13 min read
Apache SkyWalking APM: Concepts, Docker Installation, and UI Guide
DeWu Technology
DeWu Technology
Oct 17, 2022 · Operations

High Availability: Principles and Practices for System Stability

High availability—measured in nines of uptime—requires partitioning systems, decoupling components, choosing robust technologies, deploying redundant instances with automatic failover, capacity planning, rapid scaling, traffic shaping, resource isolation, global protection, observability, and disciplined change management to achieve stable, resilient services.

Observabilitycapacity planningchange management
0 likes · 10 min read
High Availability: Principles and Practices for System Stability
Cloud Native Technology Community
Cloud Native Technology Community
Oct 17, 2022 · Cloud Native

A Three‑Step Approach to Understanding, Managing, and Preventing Kubernetes Failures

This article presents a practical three‑step methodology—understanding, managing, and preventing—to troubleshoot Kubernetes deployments, explains how to leverage monitoring, observability, and incident‑response tools, and offers guidance on fostering team collaboration and building resilient, self‑healing cloud‑native systems.

Cloud NativeKubernetesObservability
0 likes · 7 min read
A Three‑Step Approach to Understanding, Managing, and Preventing Kubernetes Failures
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 8, 2022 · Operations

Complete Solution for Sentry Error and Performance Monitoring in Qiankun Micro‑Frontend Architecture

This article presents a complete solution for routing Sentry error and performance data to the correct micro‑frontend projects in a Qiankun architecture by intercepting transport, redistributing URLs, and distinguishing transaction types, with detailed code examples for both Sentry 6.x and 7.x versions.

JavaScriptMicro‑frontendObservability
0 likes · 10 min read
Complete Solution for Sentry Error and Performance Monitoring in Qiankun Micro‑Frontend Architecture
Alibaba Cloud Native
Alibaba Cloud Native
Oct 4, 2022 · Cloud Native

How Service Mesh Redefines Cloud‑Native Networking, Security, and Observability

This article explains the fundamentals of service mesh as a cloud‑native infrastructure layer, covering its control‑plane and data‑plane architecture, sidecar and waypoint proxies, L4/L7 decoupling, eBPF acceleration, zero‑trust security, traffic management, observability, and real‑world deployment scenarios.

Cloud NativeKubernetesObservability
0 likes · 20 min read
How Service Mesh Redefines Cloud‑Native Networking, Security, and Observability
Architects' Tech Alliance
Architects' Tech Alliance
Sep 29, 2022 · Databases

42 Lessons Learned from Building a Production Database (Translation)

This translated article shares 42 practical lessons from Mahesh Balakrishnan’s experience building a production database, covering customer focus, project management, design principles, code review, strategy, observability, and research practices for reliable infrastructure development.

InfrastructureObservabilityProject Management
0 likes · 10 min read
42 Lessons Learned from Building a Production Database (Translation)
DataFunSummit
DataFunSummit
Sep 28, 2022 · Big Data

Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream

This article presents a comprehensive overview of using Elasticsearch as a time series engine, covering its motivations, challenges, key features, Alibaba Cloud TimeStream optimizations such as columnar storage, LSM structures, downsampling, and integration with Prometheus and Grafana, while also discussing performance and cost considerations.

Big DataDownsamplingElasticsearch
0 likes · 15 min read
Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream
IT Architects Alliance
IT Architects Alliance
Sep 25, 2022 · Backend Development

12 Proven Strategies to Seamlessly Migrate Your Monolith to Microservices

This guide presents twelve practical steps—from understanding the trade‑offs and planning the transition to adopting monorepos, CI pipelines, API gateways, feature flags, and observability—that help teams safely decompose a large monolithic application into a robust microservices architecture.

ObservabilitySoftware Architectureapi-gateway
0 likes · 14 min read
12 Proven Strategies to Seamlessly Migrate Your Monolith to Microservices
Cloud Native Technology Community
Cloud Native Technology Community
Sep 23, 2022 · Cloud Native

What Cloud‑Native Networking Trends Kube‑OVN Reveals and How DeepFlow Enables Full‑Stack Observability

In this technical session, experts from Lingque Cloud and Yunshan Network discuss emerging cloud‑native networking trends through Kube‑OVN, demonstrate DeepFlow's full‑stack observability in Kube‑OVN environments, and answer a wide range of practical Q&A covering IP stability, underlay challenges, CNI support, and performance tuning.

CNICloud Native NetworkingDeepFlow
0 likes · 14 min read
What Cloud‑Native Networking Trends Kube‑OVN Reveals and How DeepFlow Enables Full‑Stack Observability
IT Architects Alliance
IT Architects Alliance
Sep 23, 2022 · Operations

Which APM Tool Wins? A Deep Comparison of Zipkin, SkyWalking, and Pinpoint

This article analyzes full‑link monitoring in micro‑service architectures, outlines the goals and functional modules of tracing systems, explains core concepts such as Span, Trace, and Annotation, and then compares Zipkin, SkyWalking, and Pinpoint across performance impact, scalability, data analysis depth, developer transparency, and topology visualization.

APMObservabilitycomparison
0 likes · 27 min read
Which APM Tool Wins? A Deep Comparison of Zipkin, SkyWalking, and Pinpoint
Big Data Technology Architecture
Big Data Technology Architecture
Sep 17, 2022 · Databases

Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry

This article describes how Bilibili redesigned its log service by replacing Elasticsearch with ClickHouse, introducing OpenTelemetry‑based logging, optimizing storage, query, and alerting components, and enhancing ClickHouse features such as configuration tuning, Map types, and implicit columns to achieve higher performance, lower cost, and better observability.

ClickHouseDatabase OptimizationObservability
0 likes · 28 min read
Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry
Bilibili Tech
Bilibili Tech
Sep 16, 2022 · Big Data

Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry

Bilibili’s Log Service 2.0 replaces its Elastic‑Stack pipeline with an OpenTelemetry‑driven architecture that writes logs via high‑performance Go/Java SDKs to ClickHouse, delivering ten‑fold write throughput, two‑fold query speed, one‑third storage cost, a custom query gateway, visualization UI, and advanced alerting.

ClickHouseObservabilityOpenTelemetry
0 likes · 27 min read
Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry
Architect's Guide
Architect's Guide
Sep 14, 2022 · Backend Development

Architect’s Blueprint: Backend Architecture, Microservices, Message Queues, and Observability

This article presents a comprehensive backend architecture guide covering microservice fundamentals, domain‑driven design, gateway patterns, service registration, configuration centers, observability pillars, service mesh options, and a detailed comparison of major message‑queue technologies.

Backend ArchitectureObservabilityService Mesh
0 likes · 27 min read
Architect’s Blueprint: Backend Architecture, Microservices, Message Queues, and Observability
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Sep 13, 2022 · Operations

How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices

This article details Yanxuan's four‑year evolution of a unified monitoring, alerting, and event‑bus platform for micro‑service architectures, covering design principles, technology selection, multi‑stage implementation, dynamic sampling, custom plugins, data modeling, visualization upgrades, and the final fault‑driven, system‑wide integration.

Full‑Link TracingObservabilityOperations
0 likes · 23 min read
How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 9, 2022 · Information Security

How to Build a Comprehensive Cloud‑Native Kubernetes Security Monitoring System

This article examines the evolving security risks of cloud‑native architectures, explains why traditional perimeter defenses are insufficient, introduces zero‑trust principles for Kubernetes, outlines common K8s threat vectors, and presents a complete data‑collection and monitoring solution based on the open‑source iLogtail agent.

KubernetesObservabilityZero Trust
0 likes · 30 min read
How to Build a Comprehensive Cloud‑Native Kubernetes Security Monitoring System
Efficient Ops
Efficient Ops
Sep 7, 2022 · Operations

How DeepFlow Automates Full‑Stack Observability for Cloud‑Native Environments

This article presents DeepFlow, an open‑source, highly automated observability platform that uses eBPF to provide zero‑code AutoMetrics and AutoTracing, integrates with Prometheus, OpenTelemetry and SkyWalking, and enables SRE, DevOps and NewOps teams to build full‑stack metrics and blind‑spot‑free tracing for cloud‑native applications.

DevOpsMetricsObservability
0 likes · 20 min read
How DeepFlow Automates Full‑Stack Observability for Cloud‑Native Environments
Tencent Cloud Developer
Tencent Cloud Developer
Sep 7, 2022 · Cloud Native

Why Build Probe Capabilities Based on OpenTelemetry for Cloud‑Native Observability

Building probe capabilities on OpenTelemetry gives cloud‑native teams a vendor‑neutral, standardized way to extend monitoring into full observability—supporting large‑scale, language‑specific instrumentation, plug‑and‑play plugins, and seamless integration with APM backends—so developers and operators can detect, debug, and predict faults across distributed containers.

APMCloud NativeNode.js
0 likes · 15 min read
Why Build Probe Capabilities Based on OpenTelemetry for Cloud‑Native Observability
Alibaba Cloud Native
Alibaba Cloud Native
Sep 6, 2022 · Cloud Native

What’s New in KubeVela 1.5? Deep Dive into Plugins, Observability, and Cloud Shell

Version 1.5 of the open‑source Cloud Native application delivery platform KubeVela introduces enhanced plugin specifications, built‑in observability with Prometheus‑Grafana, a browser‑based Cloud Shell, advanced Canary rollouts via OpenKruise, multi‑environment UI improvements, and performance optimizations, while moving toward CNCF incubation.

Cloud NativeKubeVelaMulti-Environment
0 likes · 16 min read
What’s New in KubeVela 1.5? Deep Dive into Plugins, Observability, and Cloud Shell
dbaplus Community
dbaplus Community
Sep 5, 2022 · Operations

How EyesTSDB Evolved into a Cloud‑Native, Second‑Level Monitoring Platform

This article details the evolution of NetEase's self‑built time‑series database EyesTSDB into a cloud‑native, second‑level monitoring solution, covering its architecture, core features, integration with VictoriaMetrics, custom plugin workflow, CMDB linkage, real‑world use cases, and future challenges.

CMDB integrationMetricsObservability
0 likes · 21 min read
How EyesTSDB Evolved into a Cloud‑Native, Second‑Level Monitoring Platform
DeWu Technology
DeWu Technology
Sep 2, 2022 · Operations

Design and Implementation of Trace2.0 Distributed Tracing Platform

Trace2.0 is an OpenTelemetry‑based distributed tracing platform that collects billions of spans daily, routes data through a control plane, OTel Server, and Kafka to ClickHouse hot‑cold storage with tail sampling, achieving 66% cost reduction, 12× compression, sub‑second query latency, and plans to offload raw spans to object storage.

Backend ArchitectureClickHouseObservability
0 likes · 12 min read
Design and Implementation of Trace2.0 Distributed Tracing Platform
Efficient Ops
Efficient Ops
Aug 31, 2022 · Operations

How Intelligent Operations and Observability Transform Cloud‑Native Environments

In this talk, Wu Yakun from Guance Cloud explains the shortcomings of traditional operations, introduces intelligent, data‑driven approaches for the cloud‑native era, and outlines how unified data collection, observability, and SLO‑based monitoring can dramatically improve fault detection and system reliability.

Intelligent OperationsObservabilitySLO
0 likes · 16 min read
How Intelligent Operations and Observability Transform Cloud‑Native Environments
Architects Research Society
Architects Research Society
Aug 25, 2022 · Operations

Core Reliability Principles in the Google Cloud Architecture Framework

This article outlines the core reliability principles of the Google Cloud Architecture Framework, explaining key terms such as SLI, SLO, error budget, and SLA, and describing design and operational guidelines for defining reliability goals, building observability, ensuring high availability, creating robust processes, effective alerting, and collaborative incident management.

Cloud ComputingError BudgetObservability
0 likes · 12 min read
Core Reliability Principles in the Google Cloud Architecture Framework
Baidu Geek Talk
Baidu Geek Talk
Aug 22, 2022 · Mobile Development

How Baidu Optimized Low‑End Device Startup Performance: A Deep Dive

This article explains how Baidu's performance team tackled the slowdown of mobile internet growth by defining low‑end devices, building observability, creating high‑efficiency tooling, redesigning key components such as KV storage and locks, and introducing a smart scheduling framework that together reduced Android cold‑start TTI by over 50% and iOS cold‑start TTI by more than 40%, while establishing a continuous anti‑degradation pipeline.

Mobile DevelopmentObservabilityPerformance Optimization
0 likes · 20 min read
How Baidu Optimized Low‑End Device Startup Performance: A Deep Dive
Architect's Guide
Architect's Guide
Aug 18, 2022 · Databases

42 Lessons Learned from Building a Production Database

This article translates and summarizes Mahesh Balakrishnan’s 42 practical insights on building a production database, covering customer focus, project management, design principles, code review, observability, research, and cultural practices for engineering teams.

InfrastructureObservabilitydatabases
0 likes · 11 min read
42 Lessons Learned from Building a Production Database
Efficient Ops
Efficient Ops
Aug 17, 2022 · Operations

Master System Monitoring with the USE Method and Prometheus

This article explains how to build a comprehensive monitoring system using the concise USE (Utilization, Saturation, Errors) method, outlines key system and application metrics, and demonstrates practical implementation with Prometheus, Grafana, full‑link tracing, and ELK for observability and performance troubleshooting.

Full‑Link TracingObservabilityPrometheus
0 likes · 13 min read
Master System Monitoring with the USE Method and Prometheus
IT Architects Alliance
IT Architects Alliance
Aug 15, 2022 · R&D Management

Essential Practices for Effective Engineering Projects and R&D Management

This article outlines comprehensive guidelines for keeping customers happy, managing projects, designing robust APIs, conducting thorough code reviews, shaping strategic direction, ensuring observability, and fostering research, all aimed at building resilient and high‑performing engineering teams.

Code ReviewObservabilityProject Management
0 likes · 12 min read
Essential Practices for Effective Engineering Projects and R&D Management
DevOps
DevOps
Aug 12, 2022 · Operations

9 DevOps Best Practices and Common Anti‑Patterns

This article explains what DevOps is, why it matters, and presents nine practical best‑practice recommendations—including culture, CI/CD, testing, observability, automation, security, and IaC—while also highlighting common anti‑patterns to avoid for successful DevOps adoption.

Anti-PatternsAutomationBest Practices
0 likes · 13 min read
9 DevOps Best Practices and Common Anti‑Patterns
Huolala Tech
Huolala Tech
Aug 11, 2022 · Operations

How Huolala Built an AI‑Powered Intelligent Monitoring Platform at Scale

This article details Huolala's journey from basic monitoring to an AI‑driven intelligent observability platform, covering AIOps concepts, a comprehensive monitoring framework, practical implementations, automated alert analysis, lessons learned, and future directions for large‑scale operations.

DevOpsHuolalaObservability
0 likes · 18 min read
How Huolala Built an AI‑Powered Intelligent Monitoring Platform at Scale
Java Architecture Diary
Java Architecture Diary
Aug 8, 2022 · Operations

How to Integrate Jaeger Tracing with Rainbond Using OpenTelemetry

This guide explains why distributed tracing is essential for micro‑service architectures, introduces Jaeger as an open‑source APM solution, and provides step‑by‑step instructions for deploying and configuring Jaeger on Rainbond with OpenTelemetry, including environment variables, service naming, and topology generation.

APMObservabilityOpenTelemetry
0 likes · 11 min read
How to Integrate Jaeger Tracing with Rainbond Using OpenTelemetry
Architecture Digest
Architecture Digest
Aug 2, 2022 · Cloud Native

Microservice Architecture and Design Patterns: Goals, Principles, and Decomposition Strategies

This article explains the four primary goals of microservice architecture, outlines essential design principles, and details a comprehensive set of decomposition and integration patterns—including business‑function, sub‑domain, transaction, Strangler, Bulkhead, Sidecar, API‑gateway, Aggregator, CQRS, Saga, observability, and deployment patterns—providing practical guidance for building resilient cloud‑native systems.

Cloud NativeObservabilityarchitecture
0 likes · 18 min read
Microservice Architecture and Design Patterns: Goals, Principles, and Decomposition Strategies
DevOps Cloud Academy
DevOps Cloud Academy
Jul 26, 2022 · Operations

9 DevOps Best Practices: What You Should Do and Not Do

This article outlines nine essential DevOps best practices—from fostering a collaborative, blameless culture and adopting CI/CD, automated testing, observability, and IaC, while also highlighting common anti‑patterns such as isolated DevOps teams, hero reliance, and unchecked tool sprawl.

AutomationDevOpsObservability
0 likes · 13 min read
9 DevOps Best Practices: What You Should Do and Not Do
dbaplus Community
dbaplus Community
Jul 24, 2022 · Fundamentals

Meta’s Secret to Near‑Zero Cache Inconsistency

Meta’s engineering team describes how they raised cache consistency from six‑nines to ten‑nines by defining precise invalidation semantics, building the Polaris observability service, and implementing systematic tracking of cache mutations, offering practical strategies that apply to any distributed cache such as Redis or TAO.

ConsistencyMetaObservability
0 likes · 17 min read
Meta’s Secret to Near‑Zero Cache Inconsistency
FunTester
FunTester
Jul 24, 2022 · Operations

Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation

Chaos engineering, a discipline for experimenting on distributed systems, helps teams identify hidden weaknesses, improve high‑availability, and build confidence in production by defining stable states, injecting realistic failures, and measuring impact through observability metrics, with practical steps, tool choices, maturity stages, and evaluation methods.

Distributed SystemsFault InjectionObservability
0 likes · 11 min read
Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation
dbaplus Community
dbaplus Community
Jul 21, 2022 · Operations

How Huolala Built an AI‑Powered End‑to‑End Monitoring Platform

This article details Huolala's journey from a fragmented monitoring stack to a unified, AI‑enhanced observability platform, covering AIOps concepts, the design of a comprehensive monitoring framework, concrete implementation of metrics, tracing, logging, alerting, and lessons learned for large‑scale operations.

DevOpsObservabilityaiops
0 likes · 19 min read
How Huolala Built an AI‑Powered End‑to‑End Monitoring Platform
Meituan Technology Team
Meituan Technology Team
Jul 21, 2022 · Backend Development

Visualized Full‑Chain Log Tracing for Complex Business Systems

The article analyzes the shortcomings of traditional ELK and distributed tracing for complex business systems, proposes a visualized full‑chain log tracing solution that organizes and dynamically links logs by business chain, and demonstrates its implementation and performance gains at Meituan’s content platform.

DSLDistributed SystemsMeituan
0 likes · 26 min read
Visualized Full‑Chain Log Tracing for Complex Business Systems
Baidu Geek Talk
Baidu Geek Talk
Jul 19, 2022 · Cloud Native

How OpenTelemetry and Jaeger Power Cloud‑Native Tracing

This article explains cloud‑native observability, defines its three pillars—metrics, tracing, and logging—details the OpenTelemetry tracing data model and Span structure, reviews industry implementations such as Jaeger and Alibaba Eagle Eye, and shares practical challenges and solutions from real‑world production use.

Alibaba Eagle EyeCloud NativeDistributed Systems
0 likes · 11 min read
How OpenTelemetry and Jaeger Power Cloud‑Native Tracing
IT Architects Alliance
IT Architects Alliance
Jul 18, 2022 · Operations

Comparison of Prometheus and Zabbix Monitoring Solutions

This article compares Prometheus and Zabbix, outlining their histories, architectures, storage models, configuration complexity, community activity, and suitability for different environments, and concludes with recommendations on when to choose each monitoring system.

ObservabilityOperationsPrometheus
0 likes · 9 min read
Comparison of Prometheus and Zabbix Monitoring Solutions
Top Architect
Top Architect
Jul 8, 2022 · Cloud Native

Understanding Service Mesh and Istio: Architecture, Features, and Hands‑On Deployment

This tutorial explains the fundamentals of service mesh, outlines Istio's architecture and core components, walks through installing Istio on Kubernetes, demonstrates a sample microservice deployment with traffic‑management, security, and observability features, and discusses when to adopt a service mesh and its alternatives.

Cloud NativeIstioObservability
0 likes · 20 min read
Understanding Service Mesh and Istio: Architecture, Features, and Hands‑On Deployment
Selected Java Interview Questions
Selected Java Interview Questions
Jul 6, 2022 · Operations

Grafana 9.0 New Features and Improvements Overview

Grafana 9.0 introduces a suite of usability enhancements—including a visual Prometheus query builder, a visual Loki LogQL generator, improved Explore‑to‑dashboard workflow, revamped heatmap panel, command palette, panel search, trace panel, navigation upgrades, and alerting refinements—aimed at simplifying observability, data visualization, and operational efficiency.

GrafanaLokiObservability
0 likes · 7 min read
Grafana 9.0 New Features and Improvements Overview
Alibaba Cloud Native
Alibaba Cloud Native
Jul 5, 2022 · Cloud Native

Unlocking eBPF: How Kernel‑Level Observability Powers Modern Cloud‑Native Apps

This article explains what eBPF is, why it was created, its core characteristics, common use cases such as network optimization, fault diagnosis, security control and performance monitoring, and provides practical step‑by‑step guidance, tooling commands, program types, and ecosystem resources for leveraging eBPF in cloud‑native environments.

Cloud NativeKubernetesLinux
0 likes · 20 min read
Unlocking eBPF: How Kernel‑Level Observability Powers Modern Cloud‑Native Apps