Tagged articles
983 articles
Page 3 of 10
Java Tech Enthusiast
Java Tech Enthusiast
Oct 28, 2025 · Backend Development

Why Rewriting a Java Microservice in Rust Cut Costs and Boosted Performance

A senior engineer recounts how replacing a noisy Java billing microservice with a lean Rust implementation slashed latency, reduced CPU and memory usage, lowered infrastructure bills, and exposed cultural and organizational challenges, offering a practical roadmap for teams considering similar migrations.

ObservabilityRustService Migration
0 likes · 11 min read
Why Rewriting a Java Microservice in Rust Cut Costs and Boosted Performance
Alibaba Cloud Observability
Alibaba Cloud Observability
Oct 27, 2025 · Operations

From Data Silos to Intelligent Insights: Building Future‑Ready Operation Intelligence

This article explains how enterprises can transform massive, fragmented operation data—technical, business, and security—into high‑value intelligent signals by unifying storage, enriching context, applying AI, and delivering a single, observable platform that enables proactive, data‑driven decision making.

AIData PlatformObservability
0 likes · 18 min read
From Data Silos to Intelligent Insights: Building Future‑Ready Operation Intelligence
DevOps Coach
DevOps Coach
Oct 22, 2025 · Cloud Native

Simplify Scalable Kubernetes Pod Logging with Grafana podLogs

This guide explains how Grafana's podLogs feature, powered by Vector.dev, transforms raw Kubernetes pod logs into enriched, searchable, cluster‑wide observability data, covering why pod‑level logs matter, configuration steps, advanced custom log paths, and practical examples.

Cloud NativeGrafanaKubernetes
0 likes · 14 min read
Simplify Scalable Kubernetes Pod Logging with Grafana podLogs
IT Architects Alliance
IT Architects Alliance
Oct 22, 2025 · Cloud Native

Avoid the Top 5 Cloud Migration Mistakes: Proven Cloud‑Native Strategies

This article analyzes the five most common cloud‑migration pitfalls—lift‑and‑shift, network latency, incomplete data‑architecture transformation, weak security models, and poor observability—offering concrete cloud‑native solutions, migration matrices, code examples, and best‑practice guidelines for successful architectural evolution.

Cloud NativeDevOpsObservability
0 likes · 12 min read
Avoid the Top 5 Cloud Migration Mistakes: Proven Cloud‑Native Strategies
Linux Kernel Journey
Linux Kernel Journey
Oct 21, 2025 · Industry Insights

Bridging the GPU Observability Gap: Why eBPF on GPUs Matters

The article explains how bpftime extends eBPF to NVIDIA and AMD GPUs, exposing fine‑grained execution details that traditional CPU‑side tools miss, and demonstrates a unified, programmable observability stack that overcomes the limitations of existing GPU profilers in both synchronous and asynchronous workloads.

CUDAGPUObservability
0 likes · 23 min read
Bridging the GPU Observability Gap: Why eBPF on GPUs Matters
Alibaba Cloud Observability
Alibaba Cloud Observability
Oct 20, 2025 · Cloud Native

How ‘泡姆泡姆’ Leverages Cloud‑Native Architecture for Global Low‑Latency Gaming

The multiplayer party game 泡姆泡姆 combines colorful shooting, match‑3, physics puzzles and arcade mini‑games, and uses a cloud‑native stack on Alibaba Cloud Container Service with OpenKruiseGame, Keda‑driven auto‑scaling, multi‑region deployment, zero‑downtime updates and a three‑layer observability platform to deliver seamless low‑latency experiences worldwide.

Game DevelopmentObservabilityScalability
0 likes · 10 min read
How ‘泡姆泡姆’ Leverages Cloud‑Native Architecture for Global Low‑Latency Gaming
JavaGuide
JavaGuide
Oct 17, 2025 · Artificial Intelligence

Alibaba Open‑Sources Spring AI Alibaba Admin: A Full‑Lifecycle AI Agent Platform

Spring AI Alibaba extends Spring AI with multi‑agent and enterprise features, but faces three engineering hurdles—inefficient prompt debugging, unguaranteed AI quality, and opaque operations—so Alibaba released Spring AI Alibaba Admin, offering prompt templating, dataset versioning, evaluator configuration, experiment management, and deep observability to streamline AI agent development and deployment.

AI agentDataset VersioningEvaluator
0 likes · 8 min read
Alibaba Open‑Sources Spring AI Alibaba Admin: A Full‑Lifecycle AI Agent Platform
Alibaba Cloud Native
Alibaba Cloud Native
Oct 16, 2025 · Artificial Intelligence

How Spring AI Alibaba Admin Powers Data‑Centric AI Agent Development and Ops

This article outlines the industry shift toward large‑scale AI Agent deployment, identifies key engineering challenges such as prompt management, quality assessment, and observability, and presents Spring AI Alibaba Admin—a cloud‑native platform that offers prompt, dataset, evaluator, and tracing capabilities, complete with setup instructions and future roadmap.

AI agentJavaNacos
0 likes · 15 min read
How Spring AI Alibaba Admin Powers Data‑Centric AI Agent Development and Ops
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Oct 16, 2025 · Operations

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

This article explains how Huawei Cloud's HyperRouter implements deterministic operations through a combination of L4/L7 load‑balancing co‑design, high‑performance data‑plane choices, self‑healing mechanisms, point‑to‑point architecture, Cell + Shuffle‑Sharding isolation, and user‑centric observability, providing a reproducible blueprint for reliable cloud services.

Cloud NativeDPDKObservability
0 likes · 17 min read
How HyperRouter Enables Deterministic Operations for L4 Load Balancing
MaGe Linux Operations
MaGe Linux Operations
Oct 14, 2025 · Cloud Native

How Loki + S3 Cuts Log Storage Costs by Up to 90% at PB Scale

This article explains how the cloud‑native Loki logging system combined with S3 object storage can reduce PB‑level log storage expenses by 80‑90%, while simplifying operations, improving query performance, and meeting compliance requirements through detailed architecture, configuration, deployment, and real‑world case studies.

Log ManagementLokiObservability
0 likes · 23 min read
How Loki + S3 Cuts Log Storage Costs by Up to 90% at PB Scale
MaGe Linux Operations
MaGe Linux Operations
Oct 12, 2025 · Operations

How to Balance Loki Tag Design and Chunk Compression to Tame Log Floods

Learn how to design low‑cardinality Loki tags, fine‑tune Chunk compression settings, and implement best‑practice configurations, pipelines, and monitoring to prevent memory overload, improve query performance, and efficiently manage massive log volumes in cloud‑native environments.

Log ManagementLokiObservability
0 likes · 38 min read
How to Balance Loki Tag Design and Chunk Compression to Tame Log Floods
Cognitive Technology Team
Cognitive Technology Team
Oct 12, 2025 · Backend Development

Resilient Microservices: Practical Patterns to Keep Your Services Alive

Learn how to tame chaotic microservices with practical resilience patterns—circuit breakers, bulkheads, smart retries, timeouts with fallbacks, and event‑driven messaging—plus tool recommendations and observability tips that ensure your system stays responsive even when individual services fail.

ObservabilityResilienceRetry
0 likes · 9 min read
Resilient Microservices: Practical Patterns to Keep Your Services Alive
Su San Talks Tech
Su San Talks Tech
Oct 10, 2025 · Operations

How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies

This comprehensive guide explains how to improve system stability and reduce online incidents by building observability, implementing distributed tracing, applying rate‑limiting and circuit‑breaker patterns, adopting blue‑green and gray deployments, managing data consistency with distributed transactions, planning capacity, optimizing performance, and preparing emergency response plans.

Circuit BreakerDeployment StrategiesObservability
0 likes · 19 min read
How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies
Linux Code Review Hub
Linux Code Review Hub
Oct 9, 2025 · Operations

Non‑Intrusive MCP Observability with eBPF: Introducing MCPSpy

The article explains how the emerging Model Context Protocol (MCP) for AI tools lacks visibility, outlines security and monitoring challenges, compares alternative tracing methods, and presents MCPSpy—a Linux‑only eBPF‑based, non‑intrusive solution that captures MCP stdio traffic, parses JSON‑RPC messages, and outputs human‑readable or JSON logs.

AI securityGoMCP
0 likes · 17 min read
Non‑Intrusive MCP Observability with eBPF: Introducing MCPSpy
Radish, Keep Going!
Radish, Keep Going!
Oct 9, 2025 · Operations

Add Observability to Legacy Java Apps with OpenTelemetry Agent (Zero Code)

This guide shows how to use the OpenTelemetry Java Agent to instantly add observability—metrics, traces, and error reporting—to long‑standing legacy Java applications without modifying a single line of code, covering setup, environment configuration, health monitoring, performance tracing, and visualizing data in Grafana.

JavaMonitoringObservability
0 likes · 7 min read
Add Observability to Legacy Java Apps with OpenTelemetry Agent (Zero Code)
MaGe Linux Operations
MaGe Linux Operations
Oct 7, 2025 · Operations

7 Fatal Monitoring Alert Mistakes That Keep You Up at 3 AM—and How to Fix Them

This article examines why ops engineers are repeatedly woken by false alerts, outlines seven common monitoring alert pitfalls—from over‑alerting to static thresholds—and provides practical solutions such as golden‑signal rules, dynamic baselines, alert enrichment, routing, suppression, and continuous quality audits.

DevOpsMonitoringObservability
0 likes · 27 min read
7 Fatal Monitoring Alert Mistakes That Keep You Up at 3 AM—and How to Fix Them
Architect's Guide
Architect's Guide
Oct 7, 2025 · Backend Development

Mastering Backend Architecture: From Microservices to Service Mesh and Message Queues

This article presents a comprehensive roadmap for backend architects, covering microservice fundamentals, design principles, gateway patterns, communication protocols, service registration, configuration management, observability pillars, service mesh options, and a detailed comparison of modern message‑queue technologies.

Cloud NativeMessage QueueMicroservices
0 likes · 29 min read
Mastering Backend Architecture: From Microservices to Service Mesh and Message Queues
IT Architects Alliance
IT Architects Alliance
Oct 6, 2025 · Cloud Native

Mastering Cloud‑Native Observability: From Metrics to Tracing

The article explains why enterprises struggle with cloud‑native observability, outlines the exponential complexity and dynamic nature of modern microservice environments, and presents a comprehensive three‑pillar approach—metrics, logging, tracing—along with practical Prometheus, OpenTelemetry, and sidecar configurations, storage choices, sampling, alerting, cost‑control, team upskilling, and future trends such as AIOps and eBPF.

Cloud NativeObservabilityOpenTelemetry
0 likes · 12 min read
Mastering Cloud‑Native Observability: From Metrics to Tracing
MaGe Linux Operations
MaGe Linux Operations
Oct 6, 2025 · Cloud Native

Prometheus vs Cloud Provider Monitoring: Which Is the Most Cost‑Effective Choice for 2025?

This article compares open‑source Prometheus + Grafana with managed cloud monitoring services, evaluating deployment complexity, functionality, scalability, security, and total cost of ownership across small, medium, and large workloads, and provides practical decision‑making guidance for teams of different sizes and requirements.

MonitoringObservabilityPrometheus
0 likes · 56 min read
Prometheus vs Cloud Provider Monitoring: Which Is the Most Cost‑Effective Choice for 2025?
MaGe Linux Operations
MaGe Linux Operations
Oct 5, 2025 · Operations

ELK vs EFK vs Loki: Which Log Solution Saves Money and Boosts Performance?

This in‑depth technical guide compares ELK, EFK, and Loki across cost, performance, deployment complexity, feature completeness, and suitability for small‑to‑large teams, providing real‑world case studies, decision trees, migration steps, and cost‑optimization tips to help you choose the most efficient logging stack for your organization.

EFKELKLog Management
0 likes · 39 min read
ELK vs EFK vs Loki: Which Log Solution Saves Money and Boosts Performance?
IT Architects Alliance
IT Architects Alliance
Oct 2, 2025 · Cloud Native

Mastering Cloud‑Native Architecture: 6 Core Principles Every Engineer Should Know

This article outlines six fundamental cloud‑native architecture principles—immutable infrastructure, service mesh, observability, declarative APIs, resilient design, and shift‑left security—explaining their purpose, key practices, code examples, and how they interrelate to build scalable, reliable, and secure distributed systems.

Cloud NativeDeclarative APIObservability
0 likes · 11 min read
Mastering Cloud‑Native Architecture: 6 Core Principles Every Engineer Should Know
Tech Freedom Circle
Tech Freedom Circle
Sep 25, 2025 · Operations

RAGFlow Link Tracing: GPS‑Style Observability for LLM‑Powered Applications

The article explains why RAGFlow needs end‑to‑end link tracing, introduces OpenTelemetry’s core concepts, shows how custom tracing utilities are implemented in Python, describes the layered architecture, provides concrete Docker and YAML configurations, and offers best‑practice guidelines for performance monitoring and fault diagnosis.

Distributed SystemsLLMObservability
0 likes · 24 min read
RAGFlow Link Tracing: GPS‑Style Observability for LLM‑Powered Applications
IT Architects Alliance
IT Architects Alliance
Sep 20, 2025 · Operations

Mastering Microservice Governance: Tracing, Config, and Monitoring Strategies

This article explores the three core challenges of microservice governance—distributed tracing, centralized configuration management, and comprehensive monitoring—offering practical solutions, tool comparisons, and best‑practice guidelines to help architects build reliable, observable, and maintainable systems.

Cloud NativeConfiguration ManagementMonitoring
0 likes · 12 min read
Mastering Microservice Governance: Tracing, Config, and Monitoring Strategies
MaGe Linux Operations
MaGe Linux Operations
Sep 18, 2025 · Cloud Native

Master Helm: Proven Best Practices for Kubernetes Deployments

This comprehensive guide walks you through Helm's architecture, chart structuring, template development, dependency management, production deployment strategies, security hardening, observability integration, testing, performance tuning, and enterprise governance, providing actionable examples and code snippets to help you become a Helm expert in cloud‑native environments.

DeploymentObservabilitychart
0 likes · 22 min read
Master Helm: Proven Best Practices for Kubernetes Deployments
Ops Community
Ops Community
Sep 15, 2025 · Cloud Native

Master Kubernetes Log Collection: From Basics to Advanced EFK & Loki Solutions

This comprehensive guide explains why log management is critical for large Kubernetes clusters, outlines common pain points, presents full‑stack architectures, details EFK and Loki implementations with code samples, and offers performance, security, cost‑optimization, and future‑trend recommendations.

Cloud NativeEFKKubernetes
0 likes · 16 min read
Master Kubernetes Log Collection: From Basics to Advanced EFK & Loki Solutions
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 12, 2025 · Operations

How to Build End‑to‑End Observability for Large‑Model Applications on Alibaba Cloud

This guide explains how to design and implement a complete observability solution for large‑model AI services on Alibaba Cloud, covering architecture, core metrics, logging standards, demo code, log collection, dashboard design, alerting, monitoring tools, troubleshooting SOPs, and recovery procedures.

AI OperationsAlibaba CloudObservability
0 likes · 21 min read
How to Build End‑to‑End Observability for Large‑Model Applications on Alibaba Cloud
dbaplus Community
dbaplus Community
Sep 11, 2025 · Cloud Native

Building a Scalable Kubernetes Monitoring Architecture and Alert Management

This guide presents a comprehensive, layered Kubernetes monitoring architecture—including control plane, node, resource, and extension layers—detailing high‑availability Prometheus deployment, alert grouping strategies, custom CRD metrics, visualization dashboards, and practical best‑practice recommendations for reliable observability in cloud‑native environments.

Cloud NativeKubernetesObservability
0 likes · 11 min read
Building a Scalable Kubernetes Monitoring Architecture and Alert Management
Ops Community
Ops Community
Sep 8, 2025 · Operations

Mastering Distributed Log Architecture: From Flume to ELK and Beyond

This comprehensive guide walks you through the challenges of large‑scale log collection, real‑time processing, storage optimization, and visualization, detailing practical configurations for Flume, Logstash, Elasticsearch, Kibana, Filebeat, Kafka, Kubernetes, and future AIOps integrations to build a reliable, cost‑effective distributed logging system.

ELKFlumeKafka
0 likes · 24 min read
Mastering Distributed Log Architecture: From Flume to ELK and Beyond
Tech Freedom Circle
Tech Freedom Circle
Sep 4, 2025 · Backend Development

How to Solve ES Latency in MySQL‑Canal Sync and Indexing Scenarios?

The article dissects the interview question about ES latency in a MySQL‑Canal‑to‑Elasticsearch pipeline, explains the root causes across four system layers, and presents a comprehensive four‑layer optimization, end‑to‑end observability, routing‑based degradation, and a Java‑based LatencyProbe component to measure and control delay.

CanalElasticsearchKafka
0 likes · 17 min read
How to Solve ES Latency in MySQL‑Canal Sync and Indexing Scenarios?
Java One
Java One
Sep 3, 2025 · Operations

How to Install, Configure, and Run Prometheus: A Step‑by‑Step Guide

This guide walks you through installing Prometheus via binary download, configuring global scrape settings and job definitions, running the server with command‑line options, and using the web UI and PromQL to verify target health and query metrics, illustrated with screenshots and example queries.

InstallationObservabilityPromQL
0 likes · 6 min read
How to Install, Configure, and Run Prometheus: A Step‑by‑Step Guide
Architect's Guide
Architect's Guide
Sep 1, 2025 · Operations

How Does Distributed Link Tracing Work? Inside SkyWalking’s Architecture

This article explains the concept of distributed link tracing, its principles, metrics, and implementation details—including monolithic and microservice approaches, OpenTracing standards, and how SkyWalking solves challenges like automatic span collection, context propagation, unique trace IDs, and sampling performance.

MicroservicesObservabilityOpenTracing
0 likes · 12 min read
How Does Distributed Link Tracing Work? Inside SkyWalking’s Architecture
php Courses
php Courses
Aug 29, 2025 · Operations

How to Build a Real‑Time PHP Log Event Pipeline for Instant Insights

Learn how to transform PHP logs into real‑time, structured events by implementing a log event pipeline that includes JSON logging, lightweight collectors like Filebeat, streaming platforms such as Kafka or Flink, enrichment, and visualization with Grafana, enabling instant monitoring, alerting, and data‑driven decisions.

FlinkGrafanaKafka
0 likes · 7 min read
How to Build a Real‑Time PHP Log Event Pipeline for Instant Insights
Nightwalker Tech
Nightwalker Tech
Aug 28, 2025 · Operations

How to Diagnose and Fix E‑commerce Order Failures with Observability, APM, and Distributed Tracing

This article explains the hierarchical relationship between APM, distributed tracing, and observability, walks through a real Double‑11 e‑commerce incident, and demonstrates how a well‑designed observability stack can pinpoint the root cause, apply emergency fixes, and restore system performance within minutes.

APMFault DiagnosisMicroservices
0 likes · 16 min read
How to Diagnose and Fix E‑commerce Order Failures with Observability, APM, and Distributed Tracing
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Aug 27, 2025 · Databases

How RedHub Revolutionizes Database Access for Billion‑User Scale

RedHub is a next‑generation database proxy built by Xiaohongshu that unifies fragmented access methods, leverages PolarDB‑X for distributed SQL execution, and delivers high‑performance, highly available, and easily observable database connectivity, enabling seamless migration and advanced features for massive‑scale workloads.

Database ProxyDistributed SQLObservability
0 likes · 15 min read
How RedHub Revolutionizes Database Access for Billion‑User Scale
Su San Talks Tech
Su San Talks Tech
Aug 27, 2025 · Backend Development

Master Distributed Tracing with SkyWalking: Principles, Architecture & Practices

This article explains the fundamentals of distributed tracing in microservice architectures, details the OpenTracing standard, examines SkyWalking’s design, sampling strategies, context propagation, and plugin development, and shares practical implementation experiences and performance comparisons, helping engineers choose and integrate effective tracing solutions.

JavaMicroservicesObservability
0 likes · 19 min read
Master Distributed Tracing with SkyWalking: Principles, Architecture & Practices
Tencent Cloud Developer
Tencent Cloud Developer
Aug 26, 2025 · Artificial Intelligence

Building a Scalable, Observable Recommendation Scheduling Engine from Scratch

This article explains how recommendation systems work, distinguishes online services from offline computation, outlines a typical recommendation flow, and presents a three‑stage evolution (1.0, 2.0, 3.0) with design principles for stability, observability, and efficiency, culminating in a DAG‑based orchestration and traceable execution.

AIObservabilityScalability
0 likes · 13 min read
Building a Scalable, Observable Recommendation Scheduling Engine from Scratch
Wuming AI
Wuming AI
Aug 26, 2025 · Artificial Intelligence

A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems

This article presents a hierarchical breakdown of Agentic AI, detailing the foundational large language models, the capabilities of AI agents, the coordination mechanisms of multi‑agent systems, and the supporting infrastructure needed for reliability, scalability, and security.

AI AgentsInfrastructureLLM
0 likes · 5 min read
A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems
Kuaishou Tech
Kuaishou Tech
Aug 20, 2025 · Frontend Development

How AI Is Transforming Frontend Development: Highlights from Kuaishou’s Tech Salon

The Kuaishou AI‑driven Frontend Technology Evolution salon gathered over 300 engineers and 46,000 online viewers to showcase how AI is reshaping large‑scale front‑end development across business, R&D, and infrastructure, with deep dives into AI‑native platforms, AIDevOps, intelligent agents, AI‑powered D2C, and observability.

AIAIDevOpsAgent
0 likes · 11 min read
How AI Is Transforming Frontend Development: Highlights from Kuaishou’s Tech Salon
dbaplus Community
dbaplus Community
Aug 19, 2025 · Operations

Avoid These 10 System Architecture Sins That Sabotage Scaling

The article enumerates ten deadly system‑architecture mistakes—such as assuming natural scaling, treating microservices as monoliths, ignoring eventual consistency, over‑relying on a single database, lacking observability, over‑designing, mixing stateful logic, skipping chaos testing, underestimating third‑party risk, and ignoring human cost—providing concrete code examples, diagrams, and actionable lessons to prevent costly failures at scale.

MicroservicesObservabilityPerformance
0 likes · 10 min read
Avoid These 10 System Architecture Sins That Sabotage Scaling
Didi Tech
Didi Tech
Aug 7, 2025 · Cloud Native

How HUATUO Revolutionizes Cloud‑Native Observability with Zero‑Impact BPF Tracing

HUATUO, Didi's open‑source cloud‑native observability project, leverages BPF‑based low‑overhead kernel tracing, unified metric and event frameworks, automatic flame‑graph generation, and seamless integration with Prometheus, Grafana and Elasticsearch to provide panoramic, zero‑intrusive monitoring and continuous performance profiling for complex production environments.

BPFCloud NativeDistributed Systems
0 likes · 11 min read
How HUATUO Revolutionizes Cloud‑Native Observability with Zero‑Impact BPF Tracing
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 6, 2025 · Operations

How Alibaba Cloud’s Serverless Elasticsearch Powers Data‑Driven Operations

Alibaba Cloud’s Serverless Elasticsearch service, combined with the SREWorks data‑driven operations platform, offers a cloud‑native, real‑time search and analytics engine that integrates metric and log collection, cost management, and health monitoring to enhance scalability, performance, and operational efficiency for enterprise applications.

Cloud NativeDataOpsElasticsearch
0 likes · 11 min read
How Alibaba Cloud’s Serverless Elasticsearch Powers Data‑Driven Operations
StarRocks
StarRocks
Aug 6, 2025 · Databases

How Qunar Migrated to StarRocks: Architecture, Performance Gains & Best Practices

This article details Qunar's transition to StarRocks as a unified OLAP engine, covering the business background, engine evaluation, architecture redesign, observability, high‑availability strategies, query‑performance optimizations, real‑world application cases, community contributions, and future plans.

Data PlatformMigrationOLAP
0 likes · 21 min read
How Qunar Migrated to StarRocks: Architecture, Performance Gains & Best Practices
DevOps Operations Practice
DevOps Operations Practice
Jul 22, 2025 · Operations

Top 7 DevOps Best Practices to Accelerate Delivery and Boost Reliability

These seven essential DevOps best practices—from cultural transformation and full automation to continuous integration, observability, security, cloud-native microservices, and performance optimization—guide teams in accelerating software delivery, enhancing quality, ensuring reliability, and reducing costs through collaborative, automated, and measurable processes.

Cloud NativeDevOpsObservability
0 likes · 4 min read
Top 7 DevOps Best Practices to Accelerate Delivery and Boost Reliability
Alibaba Cloud Native
Alibaba Cloud Native
Jul 18, 2025 · Artificial Intelligence

How AI Agent Architecture Is Evolving to Redefine Software Engineering

The article outlines the rapid evolution of AI Agent technology stacks, detailing multi‑dimensional development across perception, decision, memory, and tool integration, while highlighting cloud‑native deployment models, observability challenges, and the open‑source LoongSuite suite that provides high‑performance, low‑cost monitoring for AI workloads.

AI agentLoongSuiteObservability
0 likes · 19 min read
How AI Agent Architecture Is Evolving to Redefine Software Engineering
Ops Development & AI Practice
Ops Development & AI Practice
Jul 12, 2025 · Cloud Native

Mastering Observability: A Deep Dive into OpenTelemetry’s Architecture

This article explains OpenTelemetry’s purpose, three‑layer architecture (instrumentation, collector, backend), practical Go instrumentation code, and how the collector processes and exports telemetry to both open‑source and SaaS backends, helping developers avoid vendor lock‑in and achieve unified observability.

CollectorInstrumentationObservability
0 likes · 9 min read
Mastering Observability: A Deep Dive into OpenTelemetry’s Architecture
Java Architect Essentials
Java Architect Essentials
Jul 6, 2025 · Operations

How Logback, MDC, and ELK Can Rescue Your Nighttime Log Chaos

This article explains how chaotic, multi‑framework logging in Java microservices leads to debugging nightmares, and demonstrates a three‑step solution—standardizing on Logback, adding traceable MDC identifiers, and visualizing logs with ELK—to achieve unified log formats, sensitive data masking, and dramatically faster issue resolution.

ELKMDCObservability
0 likes · 10 min read
How Logback, MDC, and ELK Can Rescue Your Nighttime Log Chaos
Alibaba Cloud Native
Alibaba Cloud Native
Jul 1, 2025 · Cloud Native

How Alibaba Cloud Function Compute Uses OpenTelemetry for Full‑Stack Tracing

The article explains how Alibaba Cloud Function Compute upgraded its tracing capabilities from Jeager 2.0 to the OpenTelemetry W3C standard, delivering end‑to‑end observability, transparent cold‑start analysis, cross‑environment context propagation, dynamic sampling, and AI‑assisted debugging for serverless workloads.

Function ComputeObservabilityOpenTelemetry
0 likes · 6 min read
How Alibaba Cloud Function Compute Uses OpenTelemetry for Full‑Stack Tracing
macrozheng
macrozheng
Jul 1, 2025 · Operations

Best Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, Datadog & More

This article provides a comprehensive comparison of popular log management solutions—including Filebeat, Graylog, the Elastic (ELK) stack, Grafana Loki, LogDNA, Datadog, Logstash, Fluentd, and Splunk—detailing their main features, pricing models, advantages, and drawbacks to help you choose the right tool for your needs.

ELK StackLog ManagementObservability
0 likes · 16 min read
Best Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, Datadog & More
Alibaba Cloud Observability
Alibaba Cloud Observability
Jun 24, 2025 · Operations

Avoid These 6 Log Management Anti‑Patterns to Keep Your Observability Reliable

This article examines common log‑management anti‑patterns—such as copy‑truncate rotation, NAS storage, multi‑process writes, file‑hole creation, frequent overwrites, and Vim edits—explains why they cause data loss or duplicate collection, and offers practical best‑practice recommendations for reliable log handling in cloud‑native environments.

Anti-PatternsBest PracticesObservability
0 likes · 8 min read
Avoid These 6 Log Management Anti‑Patterns to Keep Your Observability Reliable
AI Large Model Application Practice
AI Large Model Application Practice
Jun 23, 2025 · Databases

How Google’s MCP Toolbox Simplifies Enterprise Database Access for LLM Agents

This guide explains Google’s open‑source MCP Toolbox for Databases, covering its core concepts, installation, configuration, two usage modes (native SDK and MCP), example LangGraph agent integration, security features, observability, and practical code snippets for building reliable LLM‑driven database tools.

LLM agentsMCP ToolboxObservability
0 likes · 11 min read
How Google’s MCP Toolbox Simplifies Enterprise Database Access for LLM Agents
Tencent Technical Engineering
Tencent Technical Engineering
Jun 20, 2025 · Artificial Intelligence

Mastering AI Agents: Core Concepts, Protocols, and Golang Frameworks for Multi‑Agent Collaboration

This comprehensive article explores the evolution of AI agents, explains key protocols like MCP and A2A, compares reasoning frameworks such as CoT, ReAct, and Plan‑and‑Execute, and demonstrates how Golang frameworks Eino and tRPC‑A2A‑Go enable elegant development, orchestration, and observability of complex multi‑agent systems with practical code examples and visual diagrams.

A2AAI agentEino
0 likes · 55 min read
Mastering AI Agents: Core Concepts, Protocols, and Golang Frameworks for Multi‑Agent Collaboration
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 17, 2025 · Artificial Intelligence

Why AI Agent Engineering Is the Missing Link to Scalable, Usable AI

This article dissects AI Agent engineering into product and technical dimensions, explaining how demand modeling, UI/UX design, prompt engineering, multi‑agent architecture, feedback loops, security, and observability together determine whether an AI assistant is usable, reliable, and ready for large‑scale deployment.

AI agentEngineeringObservability
0 likes · 22 min read
Why AI Agent Engineering Is the Missing Link to Scalable, Usable AI
Alibaba Cloud Native
Alibaba Cloud Native
Jun 12, 2025 · Artificial Intelligence

Why AI Agent Engineering Matters: From Product Design to Technical Architecture

This article breaks down AI agent engineering into product and technical engineering, explains how demand modeling, UI/UX design, prompt engineering, multi‑agent coordination, and observability combine to make AI agents usable, scalable, and trustworthy, and shows concrete examples and implementation patterns.

AIObservabilityagent engineering
0 likes · 23 min read
Why AI Agent Engineering Matters: From Product Design to Technical Architecture
Liangxu Linux
Liangxu Linux
Jun 10, 2025 · Cloud Native

Why Loki Is the Ideal Cloud‑Native Log Aggregator for Prometheus & Grafana

Loki, an open‑source log aggregation system from Grafana Labs, integrates tightly with Prometheus and Grafana, stores logs efficiently using object storage, offers a simple label‑based model, and provides cost‑effective, high‑performance logging for cloud‑native environments while outlining its architecture, usage, configuration, advantages, limitations, and retention policies.

Cloud NativeGrafanaLoki
0 likes · 10 min read
Why Loki Is the Ideal Cloud‑Native Log Aggregator for Prometheus & Grafana
JakartaEE China Community
JakartaEE China Community
Jun 9, 2025 · Cloud Native

How to Choose the Right Cloud‑Native Microservice Framework (MicroProfile vs Spring)

This article explains why cloud‑native microservices are beneficial, defines their key characteristics, compares the MicroProfile and Spring frameworks, and provides detailed code examples for REST APIs, configuration, fault tolerance, security, health checks, metrics, and distributed tracing to help developers select the most suitable technology stack.

Cloud NativeKubernetesMicroProfile
0 likes · 26 min read
How to Choose the Right Cloud‑Native Microservice Framework (MicroProfile vs Spring)
JavaEdge
JavaEdge
Jun 5, 2025 · Artificial Intelligence

How Amazon’s Strands Agents SDK Simplifies Building AI Agents

Amazon’s newly open‑source Strands Agents SDK lets developers create AI agents with minimal code by defining prompts, tools, and models, offering a lightweight, production‑ready framework that supports multiple model providers, observability, multi‑agent collaboration, and extensible tooling via dedicated packages.

AI AgentsAmazonLLM
0 likes · 7 min read
How Amazon’s Strands Agents SDK Simplifies Building AI Agents
Java Architecture Diary
Java Architecture Diary
May 26, 2025 · Artificial Intelligence

How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer

This article explains why observability is essential for Spring AI applications, outlines common cost‑control and performance challenges, and provides a step‑by‑step guide—including Maven setup, client configuration, service implementation, metric exposure, Zipkin tracing, and architecture insights—to create a fully observable, enterprise‑grade AI translation service.

MicrometerMonitoringObservability
0 likes · 12 min read
How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer
Programmer DD
Programmer DD
May 21, 2025 · Artificial Intelligence

What’s New in Spring AI 1.0 GA? A Deep Dive into Java AI Features

Spring AI 1.0 GA introduces a comprehensive suite of AI capabilities for Java developers, including a ChatClient supporting 20 models, vector‑store integrations, RAG pipelines, advanced chat memory, @Tool function calling, model evaluation, observability, Model Context Protocol, and autonomous agents, with examples for major cloud providers.

AI modelsJavaMCP
0 likes · 6 min read
What’s New in Spring AI 1.0 GA? A Deep Dive into Java AI Features
Alibaba Cloud Observability
Alibaba Cloud Observability
May 19, 2025 · Information Security

How Tool‑Poisoning Attacks Exploit MCP and What to Do About It

This article analyzes the security risks of the Model Context Protocol (MCP), demonstrates a tool‑poisoning attack that steals private keys via malicious tool descriptions, explores client‑side and server‑side threat vectors, and presents observability‑based mitigation using eBPF and LoongCollector.

AI model securityMCPObservability
0 likes · 23 min read
How Tool‑Poisoning Attacks Exploit MCP and What to Do About It
Alibaba Cloud Observability
Alibaba Cloud Observability
May 19, 2025 · Cloud Native

How LoongCollector Transforms Log Collection with High‑Performance Pipelines

LoongCollector, the 2025 evolution of iLogtail, introduces a fully redesigned pipeline architecture, hot‑reload isolation, significant CPU and memory reductions, and advanced monitoring, delivering up to 80% higher log‑collection throughput for cloud‑native environments while ensuring seamless upgrades and multi‑region fault tolerance.

Observabilitylog collectionpipeline
0 likes · 14 min read
How LoongCollector Transforms Log Collection with High‑Performance Pipelines
Alibaba Cloud Developer
Alibaba Cloud Developer
May 16, 2025 · Artificial Intelligence

Designing Robust MCP Servers for Alibaba Cloud Observability 2.0 – Lessons & Best Practices

This article explains the Model Context Protocol (MCP), its components, and how to integrate MCP servers with Alibaba Cloud Observability 2.0, offering practical design experiences, tool simplification tips, default parameter strategies, output size control, and future AI‑driven observability insights.

LLMMCPObservability
0 likes · 17 min read
Designing Robust MCP Servers for Alibaba Cloud Observability 2.0 – Lessons & Best Practices
dbaplus Community
dbaplus Community
May 11, 2025 · Operations

Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide

This guide explains the four SRE golden signals—Latency, Traffic, Errors, and Saturation—covers their definitions, how to measure them with Prometheus in Node.js, compares them to RED and USE frameworks, and provides concrete alerting rules for reliable service monitoring.

Golden SignalsObservabilityPrometheus
0 likes · 12 min read
Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide
Bilibili Tech
Bilibili Tech
May 9, 2025 · Artificial Intelligence

How an AI Gateway Scales LLM Services: Architecture, Auth, Quotas, and Load Balancing

This article explains the design of an AI gateway that centralizes LLM access, detailing its background, overall architecture, authentication, quota management, multi‑model routing, load‑balancing strategies, multi‑tenant isolation, observability features, and the supported API protocols for enterprise integration.

AI gatewayAuthenticationLLM
0 likes · 17 min read
How an AI Gateway Scales LLM Services: Architecture, Auth, Quotas, and Load Balancing
Efficient Ops
Efficient Ops
May 7, 2025 · Operations

Why Choose SigNoz for Open‑Source Observability? A Deep Dive

This article introduces SigNoz, a self‑hosted open‑source observability platform that unifies metrics, logs, and traces, outlines its core capabilities, shows how to install it with Docker, and compares its resource efficiency to commercial solutions like DataDog and Elastic.

MetricsObservabilityOpenTelemetry
0 likes · 4 min read
Why Choose SigNoz for Open‑Source Observability? A Deep Dive
macrozheng
macrozheng
May 7, 2025 · Backend Development

What’s New in Spring Boot 3.5? 13 Must‑Know Features for Java Backend Developers

Spring Boot 3.5 introduces a suite of enhancements—including task decorator support, the Vibur connection pool, SSL health metrics, flexible configuration loading, automatic Trace‑ID headers, richer Actuator capabilities, functional programming hooks, and many more—each explained with code examples and practical usage tips for modern Java backend development.

Backend DevelopmentDevOpsMicroservices
0 likes · 10 min read
What’s New in Spring Boot 3.5? 13 Must‑Know Features for Java Backend Developers
Java Architecture Diary
Java Architecture Diary
May 6, 2025 · Backend Development

Spring Boot 3.5 Release: Top 13 New Features You Must Know

Spring Boot 3.5 introduces a suite of powerful enhancements—including task decorator support, a new Vibur connection pool, SSL monitoring, flexible environment variable loading, Actuator-triggered Quartz jobs, automatic Trace ID headers, structured log customization, functional routing insights, expanded SSL client support, OpenTelemetry upgrades, Spring Batch tweaks, OAuth 2.0 JWT profiling, and functional bean registration—providing developers with richer capabilities for modern Java backend applications.

Backend DevelopmentObservabilitySpring Boot
0 likes · 11 min read
Spring Boot 3.5 Release: Top 13 New Features You Must Know
Linux Kernel Journey
Linux Kernel Journey
May 5, 2025 · Operations

Reflections on the 3rd eBPF Developer Conference: Harnessing eBPF for AI

The article recaps the 3rd eBPF Developer Conference in Xi'an, highlighting talks on BPF‑on‑MPTCP, system‑wide PGO, bperf, autonomous‑driving use cases, and AI‑driven observability, while sharing the author's insights on continuous profiling, SysOM, and future challenges of scaling eBPF with large models.

AILinuxObservability
0 likes · 10 min read
Reflections on the 3rd eBPF Developer Conference: Harnessing eBPF for AI
Efficient Ops
Efficient Ops
Apr 29, 2025 · Operations

Master Linux Performance: Essential Monitoring Tools & Commands

This guide compiles the most important Linux performance analysis utilities—such as vmstat, iostat, dstat, iotop, pidstat, top, htop, mpstat, netstat, ps, strace, uptime, lsof, and perf—explaining their usage, output fields, and how they fit into a comprehensive system observability workflow.

LinuxObservabilitySystem Administration
0 likes · 15 min read
Master Linux Performance: Essential Monitoring Tools & Commands
Efficient Ops
Efficient Ops
Apr 25, 2025 · Operations

How Changan Auto Earned Top‑Tier DevOps Certification with a Full‑Link Observability Platform

Changan Automobile’s full‑link observability platform passed both ITU DevOps international and domestic standards assessments, showcasing its advanced monitoring capabilities, improved system stability, and strategic role in the company’s digital transformation, while the interview reveals implementation challenges, benefits, and future AI‑driven enhancements.

DevOpsFull‑Link MonitoringObservability
0 likes · 21 min read
How Changan Auto Earned Top‑Tier DevOps Certification with a Full‑Link Observability Platform
Alibaba Cloud Native
Alibaba Cloud Native
Apr 23, 2025 · Cloud Native

Diagnosing Slow Deployments in Alibaba Cloud SAE: A Visualized, Step‑by‑Step Guide

This article analyzes the common pain points of Alibaba Cloud Serverless App Engine (SAE) deployments—slow release times, opaque status details, and black‑box instance startup—then presents a visualized, observable, and explainable solution that pinpoints bottlenecks, offers concrete optimizations, and demonstrates the resulting performance improvements.

Alibaba CloudDeployment OptimizationObservability
0 likes · 11 min read
Diagnosing Slow Deployments in Alibaba Cloud SAE: A Visualized, Step‑by‑Step Guide
Baidu Geek Talk
Baidu Geek Talk
Apr 23, 2025 · Operations

Baidu SRE Digital Immunity System: Construction, Evolution, and Practice

Baidu’s SRE digital‑immune system, evolved into an AI‑powered intelligent immunity platform, quantifies and mitigates risk across thousands of services by integrating data‑driven monitoring, rule‑based detection, and large‑model GraphRAG knowledge mining, cutting degradation cases by ~40% and shifting operations from reactive troubleshooting to proactive, data‑centric quality assurance.

AICloud NativeDigital Immunity
0 likes · 14 min read
Baidu SRE Digital Immunity System: Construction, Evolution, and Practice
Linux Kernel Journey
Linux Kernel Journey
Apr 23, 2025 · Industry Insights

Highlights from the 3rd eBPF Developer Conference: A Technical Recap

The 3rd eBPF Developer Conference held on April 19, 2025 at Xi'an University of Posts and Telecommunications featured 36 expert talks on eBPF advancements, network and security innovations, observability, performance optimization, a vibrant project marketplace, student projects, and provides video and PPT resources for the community.

Linux kernelObservabilityOpen Source
0 likes · 7 min read
Highlights from the 3rd eBPF Developer Conference: A Technical Recap
dbaplus Community
dbaplus Community
Apr 22, 2025 · Backend Development

Explore Elasticsearch 9.0: Performance Boosts, AI Features & Security Upgrades

Elasticsearch 9.0, released on April 15, 2025, builds on Lucene 10.1.0 to deliver major performance gains, introduces Better Binary Quantization, Elastic Distributions of OpenTelemetry, LLM observability, AI‑driven attack discovery, enhanced ES|QL, and is available via Elastic Cloud with deployment tips and examples.

AIElasticsearchObservability
0 likes · 7 min read
Explore Elasticsearch 9.0: Performance Boosts, AI Features & Security Upgrades
21CTO
21CTO
Apr 9, 2025 · Operations

9 Must‑Have Container Monitoring Tools and Best Practices for Modern Cloud‑Native Environments

This article reviews nine practical container‑monitoring solutions—from Last9 and Prometheus to Dynatrace and Elastic Observability—detailing their key features, pricing, and why developers prefer them, and then offers comprehensive best‑practice guidance for metrics, tagging, alerts, and advanced observability strategies in Kubernetes‑driven cloud‑native deployments.

Cloud NativeDevOpsKubernetes
0 likes · 25 min read
9 Must‑Have Container Monitoring Tools and Best Practices for Modern Cloud‑Native Environments
Liangxu Linux
Liangxu Linux
Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetIncident ManagementObservability
0 likes · 13 min read
How to Define SLIs, SLOs, and SLAs for Effective SRE Practices
ByteDance Cloud Native
ByteDance Cloud Native
Apr 3, 2025 · Operations

How to Seamlessly Integrate CloudWeGo with APMPlus for Full‑Stack Observability

This article explains the challenges of observability in distributed microservice and LLM architectures, introduces CloudWeGo and APMPlus, and provides step‑by‑step integration guides for Kitex, Hertz, and Eino frameworks, including code samples, data reporting methods, and advanced monitoring features such as RED metrics, LLM‑specific indicators, service topology, and future roadmap.

APMAPMPlusCloudWeGo
0 likes · 13 min read
How to Seamlessly Integrate CloudWeGo with APMPlus for Full‑Stack Observability
Volcano Engine Developer Services
Volcano Engine Developer Services
Apr 1, 2025 · Artificial Intelligence

Taming High Cardinality in AI Model & Autonomous Driving Monitoring with Prometheus

This article explores how high cardinality in Prometheus metrics impacts AI large‑model and autonomous‑driving observability, explains the underlying concepts, outlines the performance and cost challenges, and presents practical design, collection, and query‑side solutions—including metric modeling, pre‑aggregation, and remote‑read pushdown—to keep monitoring efficient and scalable.

AI MonitoringAutonomous DrivingCardinality
0 likes · 12 min read
Taming High Cardinality in AI Model & Autonomous Driving Monitoring with Prometheus