Tagged articles

983 articles

Page 3 of 10

Oct 28, 2025 · Backend Development

Why Rewriting a Java Microservice in Rust Cut Costs and Boosted Performance

A senior engineer recounts how replacing a noisy Java billing microservice with a lean Rust implementation slashed latency, reduced CPU and memory usage, lowered infrastructure bills, and exposed cultural and organizational challenges, offering a practical roadmap for teams considering similar migrations.

ObservabilityRustService Migration

0 likes · 11 min read

Why Rewriting a Java Microservice in Rust Cut Costs and Boosted Performance

Rare Earth Juejin Tech Community

Oct 28, 2025 · Backend Development

Mastering Log Practices: From Rookie Mistakes to Expert Observability

This article walks developers through common logging pitfalls, explains three maturity levels of log implementation, and provides concrete Java examples and best‑practice techniques such as structured JSON logs, MDC trace IDs, and log‑bomb avoidance to turn logs into a powerful observability tool.

MDCObservabilityTrace ID

0 likes · 14 min read

Mastering Log Practices: From Rookie Mistakes to Expert Observability

Alibaba Cloud Observability

Oct 27, 2025 · Operations

From Data Silos to Intelligent Insights: Building Future‑Ready Operation Intelligence

This article explains how enterprises can transform massive, fragmented operation data—technical, business, and security—into high‑value intelligent signals by unifying storage, enriching context, applying AI, and delivering a single, observable platform that enables proactive, data‑driven decision making.

AIData PlatformObservability

0 likes · 18 min read

From Data Silos to Intelligent Insights: Building Future‑Ready Operation Intelligence

DevOps Coach

Oct 22, 2025 · Cloud Native

Simplify Scalable Kubernetes Pod Logging with Grafana podLogs

This guide explains how Grafana's podLogs feature, powered by Vector.dev, transforms raw Kubernetes pod logs into enriched, searchable, cluster‑wide observability data, covering why pod‑level logs matter, configuration steps, advanced custom log paths, and practical examples.

Cloud NativeGrafanaKubernetes

0 likes · 14 min read

Simplify Scalable Kubernetes Pod Logging with Grafana podLogs

IT Architects Alliance

Oct 22, 2025 · Cloud Native

Avoid the Top 5 Cloud Migration Mistakes: Proven Cloud‑Native Strategies

This article analyzes the five most common cloud‑migration pitfalls—lift‑and‑shift, network latency, incomplete data‑architecture transformation, weak security models, and poor observability—offering concrete cloud‑native solutions, migration matrices, code examples, and best‑practice guidelines for successful architectural evolution.

Cloud NativeDevOpsObservability

0 likes · 12 min read

Avoid the Top 5 Cloud Migration Mistakes: Proven Cloud‑Native Strategies

Linux Kernel Journey

Oct 21, 2025 · Industry Insights

Bridging the GPU Observability Gap: Why eBPF on GPUs Matters

The article explains how bpftime extends eBPF to NVIDIA and AMD GPUs, exposing fine‑grained execution details that traditional CPU‑side tools miss, and demonstrates a unified, programmable observability stack that overcomes the limitations of existing GPU profilers in both synchronous and asynchronous workloads.

CUDAGPUObservability

0 likes · 23 min read

Bridging the GPU Observability Gap: Why eBPF on GPUs Matters

Alibaba Cloud Observability

Oct 20, 2025 · Cloud Native

How ‘泡姆泡姆’ Leverages Cloud‑Native Architecture for Global Low‑Latency Gaming

The multiplayer party game 泡姆泡姆 combines colorful shooting, match‑3, physics puzzles and arcade mini‑games, and uses a cloud‑native stack on Alibaba Cloud Container Service with OpenKruiseGame, Keda‑driven auto‑scaling, multi‑region deployment, zero‑downtime updates and a three‑layer observability platform to deliver seamless low‑latency experiences worldwide.

Game DevelopmentObservabilityScalability

0 likes · 10 min read

How ‘泡姆泡姆’ Leverages Cloud‑Native Architecture for Global Low‑Latency Gaming

JavaGuide

Oct 17, 2025 · Artificial Intelligence

Alibaba Open‑Sources Spring AI Alibaba Admin: A Full‑Lifecycle AI Agent Platform

Spring AI Alibaba extends Spring AI with multi‑agent and enterprise features, but faces three engineering hurdles—inefficient prompt debugging, unguaranteed AI quality, and opaque operations—so Alibaba released Spring AI Alibaba Admin, offering prompt templating, dataset versioning, evaluator configuration, experiment management, and deep observability to streamline AI agent development and deployment.

AI agentDataset VersioningEvaluator

0 likes · 8 min read

Alibaba Open‑Sources Spring AI Alibaba Admin: A Full‑Lifecycle AI Agent Platform

Alibaba Cloud Native

Oct 16, 2025 · Artificial Intelligence

How Spring AI Alibaba Admin Powers Data‑Centric AI Agent Development and Ops

This article outlines the industry shift toward large‑scale AI Agent deployment, identifies key engineering challenges such as prompt management, quality assessment, and observability, and presents Spring AI Alibaba Admin—a cloud‑native platform that offers prompt, dataset, evaluator, and tracing capabilities, complete with setup instructions and future roadmap.

AI agentJavaNacos

0 likes · 15 min read

How Spring AI Alibaba Admin Powers Data‑Centric AI Agent Development and Ops

Linux Ops Smart Journey

Oct 16, 2025 · Operations

Master Nightingale Monitoring: Add Data Sources, Query Metrics, Build Dashboards

This guide walks you through setting up the open‑source Nightingale monitoring platform—adding Prometheus as a data source, performing metric queries with PromQL, and creating visual dashboards—providing practical steps for building an observable, reliable operations environment.

MonitoringObservabilityPrometheus

0 likes · 5 min read

Master Nightingale Monitoring: Add Data Sources, Query Metrics, Build Dashboards

Huawei Cloud Developer Alliance

Oct 16, 2025 · Operations

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

This article explains how Huawei Cloud's HyperRouter implements deterministic operations through a combination of L4/L7 load‑balancing co‑design, high‑performance data‑plane choices, self‑healing mechanisms, point‑to‑point architecture, Cell + Shuffle‑Sharding isolation, and user‑centric observability, providing a reproducible blueprint for reliable cloud services.

Cloud NativeDPDKObservability

0 likes · 17 min read

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

MaGe Linux Operations

Oct 14, 2025 · Cloud Native

How Loki + S3 Cuts Log Storage Costs by Up to 90% at PB Scale

This article explains how the cloud‑native Loki logging system combined with S3 object storage can reduce PB‑level log storage expenses by 80‑90%, while simplifying operations, improving query performance, and meeting compliance requirements through detailed architecture, configuration, deployment, and real‑world case studies.

Log ManagementLokiObservability

0 likes · 23 min read

How Loki + S3 Cuts Log Storage Costs by Up to 90% at PB Scale

MaGe Linux Operations

Oct 12, 2025 · Operations

How to Balance Loki Tag Design and Chunk Compression to Tame Log Floods

Learn how to design low‑cardinality Loki tags, fine‑tune Chunk compression settings, and implement best‑practice configurations, pipelines, and monitoring to prevent memory overload, improve query performance, and efficiently manage massive log volumes in cloud‑native environments.

Log ManagementLokiObservability

0 likes · 38 min read

How to Balance Loki Tag Design and Chunk Compression to Tame Log Floods

Cognitive Technology Team

Oct 12, 2025 · Backend Development

Resilient Microservices: Practical Patterns to Keep Your Services Alive

Learn how to tame chaotic microservices with practical resilience patterns—circuit breakers, bulkheads, smart retries, timeouts with fallbacks, and event‑driven messaging—plus tool recommendations and observability tips that ensure your system stays responsive even when individual services fail.

ObservabilityResilienceRetry

0 likes · 9 min read

Resilient Microservices: Practical Patterns to Keep Your Services Alive

Su San Talks Tech

Oct 10, 2025 · Operations

How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies

This comprehensive guide explains how to improve system stability and reduce online incidents by building observability, implementing distributed tracing, applying rate‑limiting and circuit‑breaker patterns, adopting blue‑green and gray deployments, managing data consistency with distributed transactions, planning capacity, optimizing performance, and preparing emergency response plans.

Circuit BreakerDeployment StrategiesObservability

0 likes · 19 min read

How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies

Linux Code Review Hub

Oct 9, 2025 · Operations

Non‑Intrusive MCP Observability with eBPF: Introducing MCPSpy

The article explains how the emerging Model Context Protocol (MCP) for AI tools lacks visibility, outlines security and monitoring challenges, compares alternative tracing methods, and presents MCPSpy—a Linux‑only eBPF‑based, non‑intrusive solution that captures MCP stdio traffic, parses JSON‑RPC messages, and outputs human‑readable or JSON logs.

AI securityGoMCP

0 likes · 17 min read

Non‑Intrusive MCP Observability with eBPF: Introducing MCPSpy

Radish, Keep Going!

Oct 9, 2025 · Operations

Add Observability to Legacy Java Apps with OpenTelemetry Agent (Zero Code)

This guide shows how to use the OpenTelemetry Java Agent to instantly add observability—metrics, traces, and error reporting—to long‑standing legacy Java applications without modifying a single line of code, covering setup, environment configuration, health monitoring, performance tracing, and visualizing data in Grafana.

JavaMonitoringObservability

0 likes · 7 min read

Add Observability to Legacy Java Apps with OpenTelemetry Agent (Zero Code)

MaGe Linux Operations

Oct 7, 2025 · Operations

7 Fatal Monitoring Alert Mistakes That Keep You Up at 3 AM—and How to Fix Them

This article examines why ops engineers are repeatedly woken by false alerts, outlines seven common monitoring alert pitfalls—from over‑alerting to static thresholds—and provides practical solutions such as golden‑signal rules, dynamic baselines, alert enrichment, routing, suppression, and continuous quality audits.

DevOpsMonitoringObservability

0 likes · 27 min read

7 Fatal Monitoring Alert Mistakes That Keep You Up at 3 AM—and How to Fix Them

Architect's Guide

Oct 7, 2025 · Backend Development

Mastering Backend Architecture: From Microservices to Service Mesh and Message Queues

This article presents a comprehensive roadmap for backend architects, covering microservice fundamentals, design principles, gateway patterns, communication protocols, service registration, configuration management, observability pillars, service mesh options, and a detailed comparison of modern message‑queue technologies.

Cloud NativeMessage QueueMicroservices

0 likes · 29 min read

Mastering Backend Architecture: From Microservices to Service Mesh and Message Queues

IT Architects Alliance

Oct 6, 2025 · Cloud Native

Mastering Cloud‑Native Observability: From Metrics to Tracing

The article explains why enterprises struggle with cloud‑native observability, outlines the exponential complexity and dynamic nature of modern microservice environments, and presents a comprehensive three‑pillar approach—metrics, logging, tracing—along with practical Prometheus, OpenTelemetry, and sidecar configurations, storage choices, sampling, alerting, cost‑control, team upskilling, and future trends such as AIOps and eBPF.

Cloud NativeObservabilityOpenTelemetry

0 likes · 12 min read

Mastering Cloud‑Native Observability: From Metrics to Tracing

MaGe Linux Operations

Oct 6, 2025 · Cloud Native

Prometheus vs Cloud Provider Monitoring: Which Is the Most Cost‑Effective Choice for 2025?

This article compares open‑source Prometheus + Grafana with managed cloud monitoring services, evaluating deployment complexity, functionality, scalability, security, and total cost of ownership across small, medium, and large workloads, and provides practical decision‑making guidance for teams of different sizes and requirements.

MonitoringObservabilityPrometheus

0 likes · 56 min read

Prometheus vs Cloud Provider Monitoring: Which Is the Most Cost‑Effective Choice for 2025?

MaGe Linux Operations

Oct 5, 2025 · Operations

ELK vs EFK vs Loki: Which Log Solution Saves Money and Boosts Performance?

This in‑depth technical guide compares ELK, EFK, and Loki across cost, performance, deployment complexity, feature completeness, and suitability for small‑to‑large teams, providing real‑world case studies, decision trees, migration steps, and cost‑optimization tips to help you choose the most efficient logging stack for your organization.

EFKELKLog Management

0 likes · 39 min read

ELK vs EFK vs Loki: Which Log Solution Saves Money and Boosts Performance?

IT Architects Alliance

Oct 2, 2025 · Cloud Native

Mastering Cloud‑Native Architecture: 6 Core Principles Every Engineer Should Know

This article outlines six fundamental cloud‑native architecture principles—immutable infrastructure, service mesh, observability, declarative APIs, resilient design, and shift‑left security—explaining their purpose, key practices, code examples, and how they interrelate to build scalable, reliable, and secure distributed systems.

Cloud NativeDeclarative APIObservability

0 likes · 11 min read

Mastering Cloud‑Native Architecture: 6 Core Principles Every Engineer Should Know

Alibaba Cloud Observability

Sep 29, 2025 · Cloud Native

How Bull Group Boosted Observability by Migrating from SkyWalking to Alibaba Cloud ARMS

This article details Bull Group's journey from an open‑source SkyWalking monitoring setup to Alibaba Cloud ARMS, outlining the architectural challenges, technical selection criteria, migration steps, and the resulting improvements in observability, AI‑IoT integration, and operational efficiency.

AIAPMAlibaba Cloud

0 likes · 19 min read

How Bull Group Boosted Observability by Migrating from SkyWalking to Alibaba Cloud ARMS

Linux Ops Smart Journey

Sep 25, 2025 · Cloud Native

How to Monitor Envoy Metrics with Prometheus, Grafana, and Nacos

This guide explains how to enable Envoy's admin interface, register the service with Nacos, scrape metrics using Prometheus, and visualize them in Grafana, providing a complete observability pipeline for cloud‑native deployments.

Cloud NativeEnvoyGrafana

0 likes · 4 min read

How to Monitor Envoy Metrics with Prometheus, Grafana, and Nacos

Tech Freedom Circle

Sep 25, 2025 · Operations

RAGFlow Link Tracing: GPS‑Style Observability for LLM‑Powered Applications

The article explains why RAGFlow needs end‑to‑end link tracing, introduces OpenTelemetry’s core concepts, shows how custom tracing utilities are implemented in Python, describes the layered architecture, provides concrete Docker and YAML configurations, and offers best‑practice guidelines for performance monitoring and fault diagnosis.

Distributed SystemsLLMObservability

0 likes · 24 min read

RAGFlow Link Tracing: GPS‑Style Observability for LLM‑Powered Applications

IT Architects Alliance

Sep 20, 2025 · Operations

Mastering Microservice Governance: Tracing, Config, and Monitoring Strategies

This article explores the three core challenges of microservice governance—distributed tracing, centralized configuration management, and comprehensive monitoring—offering practical solutions, tool comparisons, and best‑practice guidelines to help architects build reliable, observable, and maintainable systems.

Cloud NativeConfiguration ManagementMonitoring

0 likes · 12 min read

Mastering Microservice Governance: Tracing, Config, and Monitoring Strategies

MaGe Linux Operations

Sep 18, 2025 · Cloud Native

Master Helm: Proven Best Practices for Kubernetes Deployments

This comprehensive guide walks you through Helm's architecture, chart structuring, template development, dependency management, production deployment strategies, security hardening, observability integration, testing, performance tuning, and enterprise governance, providing actionable examples and code snippets to help you become a Helm expert in cloud‑native environments.

DeploymentObservabilitychart

0 likes · 22 min read

Master Helm: Proven Best Practices for Kubernetes Deployments

Ops Community

Sep 15, 2025 · Cloud Native

Master Kubernetes Log Collection: From Basics to Advanced EFK & Loki Solutions

This comprehensive guide explains why log management is critical for large Kubernetes clusters, outlines common pain points, presents full‑stack architectures, details EFK and Loki implementations with code samples, and offers performance, security, cost‑optimization, and future‑trend recommendations.

Cloud NativeEFKKubernetes

0 likes · 16 min read

Master Kubernetes Log Collection: From Basics to Advanced EFK & Loki Solutions

Alibaba Cloud Developer

Sep 12, 2025 · Operations

How to Build End‑to‑End Observability for Large‑Model Applications on Alibaba Cloud

This guide explains how to design and implement a complete observability solution for large‑model AI services on Alibaba Cloud, covering architecture, core metrics, logging standards, demo code, log collection, dashboard design, alerting, monitoring tools, troubleshooting SOPs, and recovery procedures.

AI OperationsAlibaba CloudObservability

0 likes · 21 min read

How to Build End‑to‑End Observability for Large‑Model Applications on Alibaba Cloud

dbaplus Community

Sep 11, 2025 · Cloud Native

Building a Scalable Kubernetes Monitoring Architecture and Alert Management

This guide presents a comprehensive, layered Kubernetes monitoring architecture—including control plane, node, resource, and extension layers—detailing high‑availability Prometheus deployment, alert grouping strategies, custom CRD metrics, visualization dashboards, and practical best‑practice recommendations for reliable observability in cloud‑native environments.

Cloud NativeKubernetesObservability

0 likes · 11 min read

Building a Scalable Kubernetes Monitoring Architecture and Alert Management

Ops Community

Sep 8, 2025 · Operations

Mastering Distributed Log Architecture: From Flume to ELK and Beyond

This comprehensive guide walks you through the challenges of large‑scale log collection, real‑time processing, storage optimization, and visualization, detailing practical configurations for Flume, Logstash, Elasticsearch, Kibana, Filebeat, Kafka, Kubernetes, and future AIOps integrations to build a reliable, cost‑effective distributed logging system.

ELKFlumeKafka

0 likes · 24 min read

Mastering Distributed Log Architecture: From Flume to ELK and Beyond

Tech Freedom Circle

Sep 4, 2025 · Backend Development

How to Solve ES Latency in MySQL‑Canal Sync and Indexing Scenarios?

The article dissects the interview question about ES latency in a MySQL‑Canal‑to‑Elasticsearch pipeline, explains the root causes across four system layers, and presents a comprehensive four‑layer optimization, end‑to‑end observability, routing‑based degradation, and a Java‑based LatencyProbe component to measure and control delay.

CanalElasticsearchKafka

0 likes · 17 min read

How to Solve ES Latency in MySQL‑Canal Sync and Indexing Scenarios?

Java One

Sep 3, 2025 · Operations

How to Install, Configure, and Run Prometheus: A Step‑by‑Step Guide

This guide walks you through installing Prometheus via binary download, configuring global scrape settings and job definitions, running the server with command‑line options, and using the web UI and PromQL to verify target health and query metrics, illustrated with screenshots and example queries.

InstallationObservabilityPromQL

0 likes · 6 min read

How to Install, Configure, and Run Prometheus: A Step‑by‑Step Guide

Java One

Sep 1, 2025 · Cloud Native

How Prometheus Transforms Cloud‑Native Monitoring: Architecture, Data Model, and PromQL Basics

This article explains Prometheus' origins, open‑source development, CNCF graduation, core components, time‑series data model, text‑based metric protocol, powerful PromQL queries, service discovery mechanisms, and alerting practices, providing a comprehensive guide for cloud‑native observability.

Cloud NativeObservabilityPromQL

0 likes · 8 min read

How Prometheus Transforms Cloud‑Native Monitoring: Architecture, Data Model, and PromQL Basics

Architect's Guide

Sep 1, 2025 · Operations

How Does Distributed Link Tracing Work? Inside SkyWalking’s Architecture

This article explains the concept of distributed link tracing, its principles, metrics, and implementation details—including monolithic and microservice approaches, OpenTracing standards, and how SkyWalking solves challenges like automatic span collection, context propagation, unique trace IDs, and sampling performance.

MicroservicesObservabilityOpenTracing

0 likes · 12 min read

How Does Distributed Link Tracing Work? Inside SkyWalking’s Architecture

Alibaba Cloud Native

Aug 31, 2025 · Cloud Native

How Ctrip Scaled AI Model Access with Higress: Architecture, Challenges, and Solutions

Ctrip’s R&D team built an AI gateway using Higress to unify access to diverse large‑model services, addressing authentication, traffic control, fault tolerance, monitoring, and integration with internal MCP platforms, while sharing practical lessons and future plans.

HigressMCP integrationObservability

0 likes · 14 min read

How Ctrip Scaled AI Model Access with Higress: Architecture, Challenges, and Solutions

php Courses

Aug 29, 2025 · Operations

How to Build a Real‑Time PHP Log Event Pipeline for Instant Insights

Learn how to transform PHP logs into real‑time, structured events by implementing a log event pipeline that includes JSON logging, lightweight collectors like Filebeat, streaming platforms such as Kafka or Flink, enrichment, and visualization with Grafana, enabling instant monitoring, alerting, and data‑driven decisions.

FlinkGrafanaKafka

0 likes · 7 min read

How to Build a Real‑Time PHP Log Event Pipeline for Instant Insights

Nightwalker Tech

Aug 28, 2025 · Operations

How to Diagnose and Fix E‑commerce Order Failures with Observability, APM, and Distributed Tracing

This article explains the hierarchical relationship between APM, distributed tracing, and observability, walks through a real Double‑11 e‑commerce incident, and demonstrates how a well‑designed observability stack can pinpoint the root cause, apply emergency fixes, and restore system performance within minutes.

APMFault DiagnosisMicroservices

0 likes · 16 min read

How to Diagnose and Fix E‑commerce Order Failures with Observability, APM, and Distributed Tracing

Xiaohongshu Tech REDtech

Aug 27, 2025 · Databases

How RedHub Revolutionizes Database Access for Billion‑User Scale

RedHub is a next‑generation database proxy built by Xiaohongshu that unifies fragmented access methods, leverages PolarDB‑X for distributed SQL execution, and delivers high‑performance, highly available, and easily observable database connectivity, enabling seamless migration and advanced features for massive‑scale workloads.

Database ProxyDistributed SQLObservability

0 likes · 15 min read

How RedHub Revolutionizes Database Access for Billion‑User Scale

Su San Talks Tech

Aug 27, 2025 · Backend Development

Master Distributed Tracing with SkyWalking: Principles, Architecture & Practices

This article explains the fundamentals of distributed tracing in microservice architectures, details the OpenTracing standard, examines SkyWalking’s design, sampling strategies, context propagation, and plugin development, and shares practical implementation experiences and performance comparisons, helping engineers choose and integrate effective tracing solutions.

JavaMicroservicesObservability

0 likes · 19 min read

Master Distributed Tracing with SkyWalking: Principles, Architecture & Practices

Tencent Cloud Developer

Aug 26, 2025 · Artificial Intelligence

Building a Scalable, Observable Recommendation Scheduling Engine from Scratch

This article explains how recommendation systems work, distinguishes online services from offline computation, outlines a typical recommendation flow, and presents a three‑stage evolution (1.0, 2.0, 3.0) with design principles for stability, observability, and efficiency, culminating in a DAG‑based orchestration and traceable execution.

AIObservabilityScalability

0 likes · 13 min read

Building a Scalable, Observable Recommendation Scheduling Engine from Scratch

Wuming AI

Aug 26, 2025 · Artificial Intelligence

A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems

This article presents a hierarchical breakdown of Agentic AI, detailing the foundational large language models, the capabilities of AI agents, the coordination mechanisms of multi‑agent systems, and the supporting infrastructure needed for reliability, scalability, and security.

AI AgentsInfrastructureLLM

0 likes · 5 min read

A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems

Kuaishou Tech

Aug 20, 2025 · Frontend Development

How AI Is Transforming Frontend Development: Highlights from Kuaishou’s Tech Salon

The Kuaishou AI‑driven Frontend Technology Evolution salon gathered over 300 engineers and 46,000 online viewers to showcase how AI is reshaping large‑scale front‑end development across business, R&D, and infrastructure, with deep dives into AI‑native platforms, AIDevOps, intelligent agents, AI‑powered D2C, and observability.

AIAIDevOpsAgent

0 likes · 11 min read

How AI Is Transforming Frontend Development: Highlights from Kuaishou’s Tech Salon

dbaplus Community

Aug 19, 2025 · Operations

Avoid These 10 System Architecture Sins That Sabotage Scaling

The article enumerates ten deadly system‑architecture mistakes—such as assuming natural scaling, treating microservices as monoliths, ignoring eventual consistency, over‑relying on a single database, lacking observability, over‑designing, mixing stateful logic, skipping chaos testing, underestimating third‑party risk, and ignoring human cost—providing concrete code examples, diagrams, and actionable lessons to prevent costly failures at scale.

MicroservicesObservabilityPerformance

0 likes · 10 min read

Avoid These 10 System Architecture Sins That Sabotage Scaling

360 Zhihui Cloud Developer

Aug 8, 2025 · Operations

Quickly Deploy Prometheus Nginx Log Exporter for Deep Nginx Monitoring

This guide explains how to install and configure the prometheus-nginxlog-exporter in the Yunzhou Observability platform, covering its core features, metric types, one‑click deployment steps, chart visualization, alert rule setup, and common troubleshooting tips for comprehensive Nginx monitoring.

ExporterObservabilityPrometheus

0 likes · 9 min read

Quickly Deploy Prometheus Nginx Log Exporter for Deep Nginx Monitoring

Didi Tech

Aug 7, 2025 · Cloud Native

How HUATUO Revolutionizes Cloud‑Native Observability with Zero‑Impact BPF Tracing

HUATUO, Didi's open‑source cloud‑native observability project, leverages BPF‑based low‑overhead kernel tracing, unified metric and event frameworks, automatic flame‑graph generation, and seamless integration with Prometheus, Grafana and Elasticsearch to provide panoramic, zero‑intrusive monitoring and continuous performance profiling for complex production environments.

BPFCloud NativeDistributed Systems

0 likes · 11 min read

How HUATUO Revolutionizes Cloud‑Native Observability with Zero‑Impact BPF Tracing

Alibaba Cloud Big Data AI Platform

Aug 6, 2025 · Operations

How Alibaba Cloud’s Serverless Elasticsearch Powers Data‑Driven Operations

Alibaba Cloud’s Serverless Elasticsearch service, combined with the SREWorks data‑driven operations platform, offers a cloud‑native, real‑time search and analytics engine that integrates metric and log collection, cost management, and health monitoring to enhance scalability, performance, and operational efficiency for enterprise applications.

Cloud NativeDataOpsElasticsearch

0 likes · 11 min read

How Alibaba Cloud’s Serverless Elasticsearch Powers Data‑Driven Operations

StarRocks

Aug 6, 2025 · Databases

How Qunar Migrated to StarRocks: Architecture, Performance Gains & Best Practices

This article details Qunar's transition to StarRocks as a unified OLAP engine, covering the business background, engine evaluation, architecture redesign, observability, high‑availability strategies, query‑performance optimizations, real‑world application cases, community contributions, and future plans.

Data PlatformMigrationOLAP

0 likes · 21 min read

How Qunar Migrated to StarRocks: Architecture, Performance Gains & Best Practices

Alibaba Cloud Observability

Aug 4, 2025 · Cloud Native

How LoongCollector Redefines Observability for Cloud‑Native AI Workloads

LoongCollector, the core component of Alibaba Cloud's LoongSuite, delivers zero‑intrusion, multi‑tenant, high‑performance data collection and processing for AI services, enabling full‑stack observability across logs, metrics, traces, events and profiles in cloud‑native environments.

AIKubernetesObservability

0 likes · 16 min read

How LoongCollector Redefines Observability for Cloud‑Native AI Workloads

Qunar Tech Salon

Jul 22, 2025 · Databases

Quark’s Data Platform Upgrade with StarRocks: Architecture, Performance, Roadmap

This article details how Quark’s data platform consolidated multiple analytics engines into a unified StarRocks‑based OLAP solution, covering business background, engine selection, architecture redesign, performance tuning, operational practices, and future plans for scalability and reliability.

Data PlatformKubernetesOLAP

0 likes · 19 min read

Quark’s Data Platform Upgrade with StarRocks: Architecture, Performance, Roadmap

DevOps Operations Practice

Jul 22, 2025 · Operations

Top 7 DevOps Best Practices to Accelerate Delivery and Boost Reliability

These seven essential DevOps best practices—from cultural transformation and full automation to continuous integration, observability, security, cloud-native microservices, and performance optimization—guide teams in accelerating software delivery, enhancing quality, ensuring reliability, and reducing costs through collaborative, automated, and measurable processes.

Cloud NativeDevOpsObservability

0 likes · 4 min read

Top 7 DevOps Best Practices to Accelerate Delivery and Boost Reliability

Alibaba Cloud Native

Jul 18, 2025 · Artificial Intelligence

How AI Agent Architecture Is Evolving to Redefine Software Engineering

The article outlines the rapid evolution of AI Agent technology stacks, detailing multi‑dimensional development across perception, decision, memory, and tool integration, while highlighting cloud‑native deployment models, observability challenges, and the open‑source LoongSuite suite that provides high‑performance, low‑cost monitoring for AI workloads.

AI agentLoongSuiteObservability

0 likes · 19 min read

How AI Agent Architecture Is Evolving to Redefine Software Engineering

Efficient Ops

Jul 15, 2025 · Operations

Top Open‑Source Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, and More

This article reviews the most popular log‑management solutions, summarizing each tool's core features, pricing model, advantages, and drawbacks to help readers choose the right logging stack for their observability needs.

ELKGrafana LokiLog Management

0 likes · 16 min read

Top Open‑Source Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, and More

Ops Development & AI Practice

Jul 12, 2025 · Cloud Native

Mastering Observability: A Deep Dive into OpenTelemetry’s Architecture

This article explains OpenTelemetry’s purpose, three‑layer architecture (instrumentation, collector, backend), practical Go instrumentation code, and how the collector processes and exports telemetry to both open‑source and SaaS backends, helping developers avoid vendor lock‑in and achieve unified observability.

CollectorInstrumentationObservability

0 likes · 9 min read

Mastering Observability: A Deep Dive into OpenTelemetry’s Architecture

DeWu Technology

Jul 7, 2025 · Cloud Native

How to Achieve Service‑Level NAS Traffic Tracing with eBPF and Kubernetes

This article explains how to design and implement a service‑level NAS traffic tracing solution using Linux eBPF, NFS kernel hooks, and Kubernetes metadata to correlate container processes with NAS devices, generate real‑time metrics, and visualize them in Prometheus dashboards.

KubernetesMetricsNAS

0 likes · 18 min read

How to Achieve Service‑Level NAS Traffic Tracing with eBPF and Kubernetes

Java Architect Essentials

Jul 6, 2025 · Operations

How Logback, MDC, and ELK Can Rescue Your Nighttime Log Chaos

This article explains how chaotic, multi‑framework logging in Java microservices leads to debugging nightmares, and demonstrates a three‑step solution—standardizing on Logback, adding traceable MDC identifiers, and visualizing logs with ELK—to achieve unified log formats, sensitive data masking, and dramatically faster issue resolution.

ELKMDCObservability

0 likes · 10 min read

How Logback, MDC, and ELK Can Rescue Your Nighttime Log Chaos

Alibaba Cloud Native

Jul 1, 2025 · Cloud Native

How Alibaba Cloud Function Compute Uses OpenTelemetry for Full‑Stack Tracing

The article explains how Alibaba Cloud Function Compute upgraded its tracing capabilities from Jeager 2.0 to the OpenTelemetry W3C standard, delivering end‑to‑end observability, transparent cold‑start analysis, cross‑environment context propagation, dynamic sampling, and AI‑assisted debugging for serverless workloads.

Function ComputeObservabilityOpenTelemetry

0 likes · 6 min read

How Alibaba Cloud Function Compute Uses OpenTelemetry for Full‑Stack Tracing

macrozheng

Jul 1, 2025 · Operations

Best Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, Datadog & More

This article provides a comprehensive comparison of popular log management solutions—including Filebeat, Graylog, the Elastic (ELK) stack, Grafana Loki, LogDNA, Datadog, Logstash, Fluentd, and Splunk—detailing their main features, pricing models, advantages, and drawbacks to help you choose the right tool for your needs.

ELK StackLog ManagementObservability

0 likes · 16 min read

Best Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, Datadog & More

Alibaba Cloud Native

Jun 28, 2025 · Cloud Native

Deploying vLLM with llmaz and Higress: A Step‑by‑Step Cloud‑Native Guide

This tutorial walks through deploying vLLM inference services on a GPU‑enabled Kubernetes cluster using llmaz, configuring Higress as an AI gateway for traffic control, observability, and fallback model switching, and demonstrates end‑to‑end request testing.

FallbackHigressObservability

0 likes · 15 min read

Deploying vLLM with llmaz and Higress: A Step‑by‑Step Cloud‑Native Guide

AI Algorithm Path

Jun 26, 2025 · Artificial Intelligence

The 10 Essential Components of a Retrieval‑Augmented Generation (RAG) System

This guide breaks down the ten core building blocks of a production‑ready RAG pipeline—from input handling and vector stores to prompt engineering, LLM inference, observability, and evaluation—showing why each piece matters, common pitfalls, and practical best‑practice recommendations.

LLMObservabilityRAG

0 likes · 9 min read

The 10 Essential Components of a Retrieval‑Augmented Generation (RAG) System

Alibaba Cloud Observability

Jun 24, 2025 · Operations

Avoid These 6 Log Management Anti‑Patterns to Keep Your Observability Reliable

This article examines common log‑management anti‑patterns—such as copy‑truncate rotation, NAS storage, multi‑process writes, file‑hole creation, frequent overwrites, and Vim edits—explains why they cause data loss or duplicate collection, and offers practical best‑practice recommendations for reliable log handling in cloud‑native environments.

Anti-PatternsBest PracticesObservability

0 likes · 8 min read

Avoid These 6 Log Management Anti‑Patterns to Keep Your Observability Reliable

AI Large Model Application Practice

Jun 23, 2025 · Databases

How Google’s MCP Toolbox Simplifies Enterprise Database Access for LLM Agents

This guide explains Google’s open‑source MCP Toolbox for Databases, covering its core concepts, installation, configuration, two usage modes (native SDK and MCP), example LangGraph agent integration, security features, observability, and practical code snippets for building reliable LLM‑driven database tools.

LLM agentsMCP ToolboxObservability

0 likes · 11 min read

How Google’s MCP Toolbox Simplifies Enterprise Database Access for LLM Agents

Tencent Technical Engineering

Jun 20, 2025 · Artificial Intelligence

Mastering AI Agents: Core Concepts, Protocols, and Golang Frameworks for Multi‑Agent Collaboration

This comprehensive article explores the evolution of AI agents, explains key protocols like MCP and A2A, compares reasoning frameworks such as CoT, ReAct, and Plan‑and‑Execute, and demonstrates how Golang frameworks Eino and tRPC‑A2A‑Go enable elegant development, orchestration, and observability of complex multi‑agent systems with practical code examples and visual diagrams.

A2AAI agentEino

0 likes · 55 min read

Mastering AI Agents: Core Concepts, Protocols, and Golang Frameworks for Multi‑Agent Collaboration

Alibaba Cloud Developer

Jun 17, 2025 · Artificial Intelligence

Why AI Agent Engineering Is the Missing Link to Scalable, Usable AI

This article dissects AI Agent engineering into product and technical dimensions, explaining how demand modeling, UI/UX design, prompt engineering, multi‑agent architecture, feedback loops, security, and observability together determine whether an AI assistant is usable, reliable, and ready for large‑scale deployment.

AI agentEngineeringObservability

0 likes · 22 min read

Why AI Agent Engineering Is the Missing Link to Scalable, Usable AI

Alibaba Cloud Native

Jun 12, 2025 · Artificial Intelligence

Why AI Agent Engineering Matters: From Product Design to Technical Architecture

This article breaks down AI agent engineering into product and technical engineering, explains how demand modeling, UI/UX design, prompt engineering, multi‑agent coordination, and observability combine to make AI agents usable, scalable, and trustworthy, and shows concrete examples and implementation patterns.

AIObservabilityagent engineering

0 likes · 23 min read

Why AI Agent Engineering Matters: From Product Design to Technical Architecture

vivo Internet Technology

Jun 11, 2025 · Big Data

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

This article details Vivo's end‑to‑end Pulsar observability solution, covering the challenges of Prometheus‑based monitoring, the architecture of the alerting pipeline, adaptor development, metric optimizations for subscription backlog and bundle load, and fixes for kop lag reporting issues.

Big DataMetricsMonitoring

0 likes · 12 min read

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

Liangxu Linux

Jun 10, 2025 · Cloud Native

Why Loki Is the Ideal Cloud‑Native Log Aggregator for Prometheus & Grafana

Loki, an open‑source log aggregation system from Grafana Labs, integrates tightly with Prometheus and Grafana, stores logs efficiently using object storage, offers a simple label‑based model, and provides cost‑effective, high‑performance logging for cloud‑native environments while outlining its architecture, usage, configuration, advantages, limitations, and retention policies.

Cloud NativeGrafanaLoki

0 likes · 10 min read

Why Loki Is the Ideal Cloud‑Native Log Aggregator for Prometheus & Grafana

Big Data Technology Tribe

Jun 10, 2025 · Cloud Native

Mastering eBPF Maps: Design, Implementation, and Real‑World Use Cases

This article provides an in‑depth analysis of BPF maps—explaining their design principles, core features, various map types with code examples, and the macro expansion process that turns high‑level BCC helpers into native kernel map definitions for cloud‑native observability.

BCCBPF mapsLinux kernel

0 likes · 12 min read

Mastering eBPF Maps: Design, Implementation, and Real‑World Use Cases

JakartaEE China Community

Jun 9, 2025 · Cloud Native

How to Choose the Right Cloud‑Native Microservice Framework (MicroProfile vs Spring)

This article explains why cloud‑native microservices are beneficial, defines their key characteristics, compares the MicroProfile and Spring frameworks, and provides detailed code examples for REST APIs, configuration, fault tolerance, security, health checks, metrics, and distributed tracing to help developers select the most suitable technology stack.

Cloud NativeKubernetesMicroProfile

0 likes · 26 min read

How to Choose the Right Cloud‑Native Microservice Framework (MicroProfile vs Spring)

Alibaba Cloud Developer

Jun 6, 2025 · Big Data

Why Observability 2.0 and SLS Data Pipelines Are Revolutionizing Log Analytics

This article explains how Observability 2.0 reshapes log, metric and trace management by unifying health views, introduces the evolution of Alibaba Cloud's SLS data pipeline, compares its three service modes, and demonstrates performance, cost and integration benefits for large‑scale, real‑time log processing.

Big DataObservabilitySLS

0 likes · 11 min read

Why Observability 2.0 and SLS Data Pipelines Are Revolutionizing Log Analytics

JavaEdge

Jun 5, 2025 · Artificial Intelligence

How Amazon’s Strands Agents SDK Simplifies Building AI Agents

Amazon’s newly open‑source Strands Agents SDK lets developers create AI agents with minimal code by defining prompts, tools, and models, offering a lightweight, production‑ready framework that supports multiple model providers, observability, multi‑agent collaboration, and extensible tooling via dedicated packages.

AI AgentsAmazonLLM

0 likes · 7 min read

How Amazon’s Strands Agents SDK Simplifies Building AI Agents

Linux Ops Smart Journey

May 29, 2025 · Cloud Native

Master Kubernetes Monitoring with kube-state-metrics and Prometheus

This guide walks you through deploying kube-state-metrics, configuring Prometheus scrape jobs, verifying metric collection, and adding Grafana dashboards to achieve a visible, manageable, and reliable Kubernetes monitoring solution for large‑scale clusters.

KubernetesMonitoringObservability

0 likes · 7 min read

Master Kubernetes Monitoring with kube-state-metrics and Prometheus

Java Architecture Diary

May 26, 2025 · Artificial Intelligence

How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer

This article explains why observability is essential for Spring AI applications, outlines common cost‑control and performance challenges, and provides a step‑by‑step guide—including Maven setup, client configuration, service implementation, metric exposure, Zipkin tracing, and architecture insights—to create a fully observable, enterprise‑grade AI translation service.

MicrometerMonitoringObservability

0 likes · 12 min read

How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer

Programmer DD

May 21, 2025 · Artificial Intelligence

What’s New in Spring AI 1.0 GA? A Deep Dive into Java AI Features

Spring AI 1.0 GA introduces a comprehensive suite of AI capabilities for Java developers, including a ChatClient supporting 20 models, vector‑store integrations, RAG pipelines, advanced chat memory, @Tool function calling, model evaluation, observability, Model Context Protocol, and autonomous agents, with examples for major cloud providers.

AI modelsJavaMCP

0 likes · 6 min read

What’s New in Spring AI 1.0 GA? A Deep Dive into Java AI Features

dbaplus Community

May 20, 2025 · Operations

How to Build a Production‑Ready, High‑Availability Kubernetes Cluster from Scratch

This guide walks through designing, deploying, securing, monitoring, backing up, and maintaining a production‑grade Kubernetes cluster, sharing real‑world pitfalls, configuration snippets, and best‑practice recommendations for high availability, security, observability, and upgrade strategies.

KubernetesObservabilitybackup

0 likes · 11 min read

How to Build a Production‑Ready, High‑Availability Kubernetes Cluster from Scratch

Alibaba Cloud Native

May 20, 2025 · Cloud Native

How Observability 2.0 Redefines Cloud‑Native Log Pipelines and Cuts Costs by 66%

Observability 2.0 unifies logs, metrics and traces into a single platform, introduces event‑centric Wide Events, and drives a complete redesign of Alibaba Cloud's SLS data pipeline that delivers higher performance, lower latency, richer low‑code SPL processing, and up to a 66.7% reduction in processing costs.

ObservabilityPerformanceSPL

0 likes · 12 min read

How Observability 2.0 Redefines Cloud‑Native Log Pipelines and Cuts Costs by 66%

Alibaba Cloud Observability

May 19, 2025 · Information Security

How Tool‑Poisoning Attacks Exploit MCP and What to Do About It

This article analyzes the security risks of the Model Context Protocol (MCP), demonstrates a tool‑poisoning attack that steals private keys via malicious tool descriptions, explores client‑side and server‑side threat vectors, and presents observability‑based mitigation using eBPF and LoongCollector.

AI model securityMCPObservability

0 likes · 23 min read

How Tool‑Poisoning Attacks Exploit MCP and What to Do About It

Alibaba Cloud Observability

May 19, 2025 · Cloud Native

How LoongCollector Transforms Log Collection with High‑Performance Pipelines

LoongCollector, the 2025 evolution of iLogtail, introduces a fully redesigned pipeline architecture, hot‑reload isolation, significant CPU and memory reductions, and advanced monitoring, delivering up to 80% higher log‑collection throughput for cloud‑native environments while ensuring seamless upgrades and multi‑region fault tolerance.

Observabilitylog collectionpipeline

0 likes · 14 min read

How LoongCollector Transforms Log Collection with High‑Performance Pipelines

Alibaba Cloud Developer

May 16, 2025 · Artificial Intelligence

Designing Robust MCP Servers for Alibaba Cloud Observability 2.0 – Lessons & Best Practices

This article explains the Model Context Protocol (MCP), its components, and how to integrate MCP servers with Alibaba Cloud Observability 2.0, offering practical design experiences, tool simplification tips, default parameter strategies, output size control, and future AI‑driven observability insights.

LLMMCPObservability

0 likes · 17 min read

Designing Robust MCP Servers for Alibaba Cloud Observability 2.0 – Lessons & Best Practices

dbaplus Community

May 11, 2025 · Operations

Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide

This guide explains the four SRE golden signals—Latency, Traffic, Errors, and Saturation—covers their definitions, how to measure them with Prometheus in Node.js, compares them to RED and USE frameworks, and provides concrete alerting rules for reliable service monitoring.

Golden SignalsObservabilityPrometheus

0 likes · 12 min read

Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide

Bilibili Tech

May 9, 2025 · Artificial Intelligence

How an AI Gateway Scales LLM Services: Architecture, Auth, Quotas, and Load Balancing

This article explains the design of an AI gateway that centralizes LLM access, detailing its background, overall architecture, authentication, quota management, multi‑model routing, load‑balancing strategies, multi‑tenant isolation, observability features, and the supported API protocols for enterprise integration.

AI gatewayAuthenticationLLM

0 likes · 17 min read

How an AI Gateway Scales LLM Services: Architecture, Auth, Quotas, and Load Balancing

StarRocks

May 8, 2025 · Backend Development

How Grab Supercharged Spark Observability 10× with StarRocks – Inside the Iris Architecture

Grab replaced its fragmented Grafana‑Superset stack with a StarRocks‑backed Iris platform, achieving over ten‑fold query speedups, 40% lower resource usage, and a unified real‑time and historical data store for Spark observability across its Southeast Asian super‑app ecosystem.

Data PlatformKafkaMaterialized Views

0 likes · 16 min read

How Grab Supercharged Spark Observability 10× with StarRocks – Inside the Iris Architecture

Liangxu Linux

May 7, 2025 · Operations

How to Install and Configure Loki, Promtail, and Grafana for Log Aggregation on Rocky Linux

This step‑by‑step guide shows how to install Loki, configure its YAML file, set up Promtail to ship logs, install Grafana, add Loki as a data source, and use LogQL to query logs—including collecting Nginx JSON logs—on a Rocky Linux system.

GrafanaLogQLLoki

0 likes · 10 min read

How to Install and Configure Loki, Promtail, and Grafana for Log Aggregation on Rocky Linux

Efficient Ops

May 7, 2025 · Operations

Why Choose SigNoz for Open‑Source Observability? A Deep Dive

This article introduces SigNoz, a self‑hosted open‑source observability platform that unifies metrics, logs, and traces, outlines its core capabilities, shows how to install it with Docker, and compares its resource efficiency to commercial solutions like DataDog and Elastic.

MetricsObservabilityOpenTelemetry

0 likes · 4 min read

Why Choose SigNoz for Open‑Source Observability? A Deep Dive

macrozheng

May 7, 2025 · Backend Development

What’s New in Spring Boot 3.5? 13 Must‑Know Features for Java Backend Developers

Spring Boot 3.5 introduces a suite of enhancements—including task decorator support, the Vibur connection pool, SSL health metrics, flexible configuration loading, automatic Trace‑ID headers, richer Actuator capabilities, functional programming hooks, and many more—each explained with code examples and practical usage tips for modern Java backend development.

Backend DevelopmentDevOpsMicroservices

0 likes · 10 min read

What’s New in Spring Boot 3.5? 13 Must‑Know Features for Java Backend Developers

Java Architecture Diary

May 6, 2025 · Backend Development

Spring Boot 3.5 Release: Top 13 New Features You Must Know

Spring Boot 3.5 introduces a suite of powerful enhancements—including task decorator support, a new Vibur connection pool, SSL monitoring, flexible environment variable loading, Actuator-triggered Quartz jobs, automatic Trace ID headers, structured log customization, functional routing insights, expanded SSL client support, OpenTelemetry upgrades, Spring Batch tweaks, OAuth 2.0 JWT profiling, and functional bean registration—providing developers with richer capabilities for modern Java backend applications.

Backend DevelopmentObservabilitySpring Boot

0 likes · 11 min read

Spring Boot 3.5 Release: Top 13 New Features You Must Know

Linux Kernel Journey

May 5, 2025 · Operations

Reflections on the 3rd eBPF Developer Conference: Harnessing eBPF for AI

The article recaps the 3rd eBPF Developer Conference in Xi'an, highlighting talks on BPF‑on‑MPTCP, system‑wide PGO, bperf, autonomous‑driving use cases, and AI‑driven observability, while sharing the author's insights on continuous profiling, SysOM, and future challenges of scaling eBPF with large models.

AILinuxObservability

0 likes · 10 min read

Reflections on the 3rd eBPF Developer Conference: Harnessing eBPF for AI

Raymond Ops

Apr 30, 2025 · Cloud Native

Master Loki Logging: Step-by-Step Kubernetes Deployment & Troubleshooting Guide

This comprehensive guide explains Loki's lightweight log aggregation architecture, compares it with ELK, details AllInOne, Helm, Kubernetes, and bare‑metal deployment methods, shows Promtail and Logstash integration, and provides practical troubleshooting tips for common issues.

LokiObservabilityPromtail

0 likes · 23 min read

Master Loki Logging: Step-by-Step Kubernetes Deployment & Troubleshooting Guide

Efficient Ops

Apr 29, 2025 · Operations

Master Linux Performance: Essential Monitoring Tools & Commands

This guide compiles the most important Linux performance analysis utilities—such as vmstat, iostat, dstat, iotop, pidstat, top, htop, mpstat, netstat, ps, strace, uptime, lsof, and perf—explaining their usage, output fields, and how they fit into a comprehensive system observability workflow.

LinuxObservabilitySystem Administration

0 likes · 15 min read

Master Linux Performance: Essential Monitoring Tools & Commands

Efficient Ops

Apr 25, 2025 · Operations

How Changan Auto Earned Top‑Tier DevOps Certification with a Full‑Link Observability Platform

Changan Automobile’s full‑link observability platform passed both ITU DevOps international and domestic standards assessments, showcasing its advanced monitoring capabilities, improved system stability, and strategic role in the company’s digital transformation, while the interview reveals implementation challenges, benefits, and future AI‑driven enhancements.

DevOpsFull‑Link MonitoringObservability

0 likes · 21 min read

How Changan Auto Earned Top‑Tier DevOps Certification with a Full‑Link Observability Platform

Alibaba Cloud Native

Apr 23, 2025 · Cloud Native

Diagnosing Slow Deployments in Alibaba Cloud SAE: A Visualized, Step‑by‑Step Guide

This article analyzes the common pain points of Alibaba Cloud Serverless App Engine (SAE) deployments—slow release times, opaque status details, and black‑box instance startup—then presents a visualized, observable, and explainable solution that pinpoints bottlenecks, offers concrete optimizations, and demonstrates the resulting performance improvements.

Alibaba CloudDeployment OptimizationObservability

0 likes · 11 min read

Diagnosing Slow Deployments in Alibaba Cloud SAE: A Visualized, Step‑by‑Step Guide

Baidu Geek Talk

Apr 23, 2025 · Operations

Baidu SRE Digital Immunity System: Construction, Evolution, and Practice

Baidu’s SRE digital‑immune system, evolved into an AI‑powered intelligent immunity platform, quantifies and mitigates risk across thousands of services by integrating data‑driven monitoring, rule‑based detection, and large‑model GraphRAG knowledge mining, cutting degradation cases by ~40% and shifting operations from reactive troubleshooting to proactive, data‑centric quality assurance.

AICloud NativeDigital Immunity

0 likes · 14 min read

Baidu SRE Digital Immunity System: Construction, Evolution, and Practice

Linux Kernel Journey

Apr 23, 2025 · Industry Insights

Highlights from the 3rd eBPF Developer Conference: A Technical Recap

The 3rd eBPF Developer Conference held on April 19, 2025 at Xi'an University of Posts and Telecommunications featured 36 expert talks on eBPF advancements, network and security innovations, observability, performance optimization, a vibrant project marketplace, student projects, and provides video and PPT resources for the community.

Linux kernelObservabilityOpen Source

0 likes · 7 min read

Highlights from the 3rd eBPF Developer Conference: A Technical Recap

dbaplus Community

Apr 22, 2025 · Backend Development

Explore Elasticsearch 9.0: Performance Boosts, AI Features & Security Upgrades

Elasticsearch 9.0, released on April 15, 2025, builds on Lucene 10.1.0 to deliver major performance gains, introduces Better Binary Quantization, Elastic Distributions of OpenTelemetry, LLM observability, AI‑driven attack discovery, enhanced ES|QL, and is available via Elastic Cloud with deployment tips and examples.

AIElasticsearchObservability

0 likes · 7 min read

Explore Elasticsearch 9.0: Performance Boosts, AI Features & Security Upgrades

Zhuanzhuan Tech

Apr 16, 2025 · Backend Development

Analyzing Log4j2 Asynchronous Logging Blocking and Strategies for Fine-Grained Log Control

This article examines the causes of Log4j2 asynchronous logging blockage in high‑throughput Java services, explains the underlying Disruptor mechanics, and proposes a dual‑track logging architecture with compile‑time bytecode enhancement and IDE plugins for line‑level log activation.

JavaLogging StrategyObservability

0 likes · 15 min read

Analyzing Log4j2 Asynchronous Logging Blocking and Strategies for Fine-Grained Log Control

21CTO

Apr 9, 2025 · Operations

9 Must‑Have Container Monitoring Tools and Best Practices for Modern Cloud‑Native Environments

This article reviews nine practical container‑monitoring solutions—from Last9 and Prometheus to Dynatrace and Elastic Observability—detailing their key features, pricing, and why developers prefer them, and then offers comprehensive best‑practice guidance for metrics, tagging, alerts, and advanced observability strategies in Kubernetes‑driven cloud‑native deployments.

Cloud NativeDevOpsKubernetes

0 likes · 25 min read

9 Must‑Have Container Monitoring Tools and Best Practices for Modern Cloud‑Native Environments

Liangxu Linux

Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetIncident ManagementObservability

0 likes · 13 min read

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

ByteDance Cloud Native

Apr 3, 2025 · Operations

How to Seamlessly Integrate CloudWeGo with APMPlus for Full‑Stack Observability

This article explains the challenges of observability in distributed microservice and LLM architectures, introduces CloudWeGo and APMPlus, and provides step‑by‑step integration guides for Kitex, Hertz, and Eino frameworks, including code samples, data reporting methods, and advanced monitoring features such as RED metrics, LLM‑specific indicators, service topology, and future roadmap.

APMAPMPlusCloudWeGo

0 likes · 13 min read

How to Seamlessly Integrate CloudWeGo with APMPlus for Full‑Stack Observability

Volcano Engine Developer Services

Apr 1, 2025 · Artificial Intelligence

Taming High Cardinality in AI Model & Autonomous Driving Monitoring with Prometheus

This article explores how high cardinality in Prometheus metrics impacts AI large‑model and autonomous‑driving observability, explains the underlying concepts, outlines the performance and cost challenges, and presents practical design, collection, and query‑side solutions—including metric modeling, pre‑aggregation, and remote‑read pushdown—to keep monitoring efficient and scalable.

AI MonitoringAutonomous DrivingCardinality

0 likes · 12 min read

Taming High Cardinality in AI Model & Autonomous Driving Monitoring with Prometheus