Tagged articles

33 articles

Page 1 of 1

Dec 17, 2025 · Operations

Build a Production‑Ready Prometheus HA Architecture with Federation and Remote Storage

Learn how to design and implement a robust, production‑grade Prometheus high‑availability solution using a federated global cluster, multiple business‑level instances, remote storage with Thanos or VictoriaMetrics, Docker‑Compose deployment, health‑check scripts, performance metrics, alerting rules, and best‑practice operational guidelines.

Docker-ComposeFederationRemote Storage

0 likes · 17 min read

Build a Production‑Ready Prometheus HA Architecture with Federation and Remote Storage

MaGe Linux Operations

Jul 22, 2025 · Operations

Build a Production-Ready Prometheus HA Architecture with Federation & Remote Storage

This guide walks through designing and implementing a robust, enterprise‑grade Prometheus high‑availability solution using federation clusters, remote storage back‑ends, Docker‑Compose deployments, health‑check scripts, and best‑practice recommendations for monitoring, security, and performance.

Docker-ComposeFederationRemote Storage

0 likes · 17 min read

Build a Production-Ready Prometheus HA Architecture with Federation & Remote Storage

Kuaishou Tech

Oct 31, 2024 · Cloud Native

Stateful Service Cloud‑Native Practices: Kuaishou’s Redis on Kubernetes

This article examines the challenges and benefits of running stateful services such as Redis on Kubernetes, presents Kuaishou’s practical experience with cloud‑native migration, evaluates risks and performance impacts, and details the custom workloads, operators, federation and KubeBlocks solutions that enable large‑scale, reliable stateful service orchestration.

Cloud NativeFederationKubeBlocks

0 likes · 12 min read

Stateful Service Cloud‑Native Practices: Kuaishou’s Redis on Kubernetes

DataFunSummit

Oct 17, 2024 · Big Data

Waggle Dance Based Metadata Solution at Tongcheng Travel: Architecture, Migration Strategies, and Future Outlook

This article presents Tongcheng Travel's metadata solution built on the open‑source Waggle Dance project, detailing the three‑layer architecture, challenges of a monolithic Hive Metastore, evaluated migration plans, federation implementation, migration workflow, and future directions for unified metadata governance.

Data MigrationFederationHive Metastore

0 likes · 11 min read

Waggle Dance Based Metadata Solution at Tongcheng Travel: Architecture, Migration Strategies, and Future Outlook

Ops Development Stories

Jun 28, 2024 · Cloud Native

Multi-Cluster Kubernetes: Benefits, Federation, Karmada, and Practical Tips

This article explains why organizations adopt multi‑cluster Kubernetes for high availability, hybrid‑cloud scaling, and fault isolation, outlines the preparatory steps, compares Federation v1 and v2, introduces Karmada as a CNCF project, and shares practical non‑federated deployment, monitoring, traffic management, and migration techniques with code examples.

Cloud NativeDevOpsFederation

0 likes · 18 min read

Multi-Cluster Kubernetes: Benefits, Federation, Karmada, and Practical Tips

Alibaba Cloud Native

Apr 8, 2024 · Cloud Native

How to Build a Global View for Multiple Prometheus Instances – Community and Alibaba Cloud Solutions

This article explains why a global view is needed when Prometheus metrics are scattered across many instances, compares community approaches such as Federation, Thanos, and Remote Write, and details Alibaba Cloud's Global Aggregation Instance and Remote Write solutions with configuration examples and a real‑world case study.

FederationGlobal ViewMonitoring

0 likes · 25 min read

How to Build a Global View for Multiple Prometheus Instances – Community and Alibaba Cloud Solutions

DevOps Operations Practice

Mar 14, 2024 · Operations

Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions

This article analyzes why a single Prometheus instance repeatedly runs out of memory and crashes, explains the underlying storage mechanisms, and presents practical solutions such as metric reduction, retention tuning, federation architecture, and remote storage integration to improve stability and scalability.

FederationMonitoringPrometheus

0 likes · 6 min read

Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions

Volcano Engine Developer Services

Jul 7, 2023 · Cloud Native

How KubeAdmiral Redefines Multi-Cluster Kubernetes Federation for Scale and Efficiency

Since Kubernetes became the de‑facto standard, ByteDance faced scaling limits with single‑cluster setups, prompting the adoption of KubeFed V2 and later the development of KubeAdmiral, a next‑generation multi‑cluster federation system that enhances scheduling, resource efficiency, native API support, and dynamic scaling across clouds.

FederationKubeAdmiralKubernetes

0 likes · 15 min read

How KubeAdmiral Redefines Multi-Cluster Kubernetes Federation for Scale and Efficiency

Cloud Native Technology Community

Feb 28, 2023 · Cloud Native

Mastering Kubernetes Federation: Step‑by‑Step Installation and Multi‑Cluster Management

This guide explains the purpose of Kubernetes Federation, walks through installing Helm, the kubefed controller plane, and kubefedctl, then details how to join clusters, enable resource federation, deploy sample workloads, and provides a handy command reference for multi‑cluster operations.

FederationKubeFedKubernetes

0 likes · 9 min read

Mastering Kubernetes Federation: Step‑by‑Step Installation and Multi‑Cluster Management

Efficient Ops

Aug 28, 2022 · Cloud Native

Mastering Kubernetes Federation: Install, Join Clusters, and Sync Resources

This guide explains the purpose of Kubernetes Federation, its benefits for multi‑cluster management, step‑by‑step installation using Helm and kubefedctl, how to join and unjoin clusters, enable resource federation, and provides a cheat sheet of common commands for reliable cross‑cluster deployments.

FederationKubeFedKubernetes

0 likes · 8 min read

Mastering Kubernetes Federation: Install, Join Clusters, and Sync Resources

MaGe Linux Operations

Jul 9, 2022 · Cloud Native

Mastering Multi‑Cluster Management: From Kubernetes Federation v1/v2 to Karmada

This article explains why Kubernetes federation is needed for managing multiple clusters, compares the deprecated Federation v1 with the improved v2 architecture, and introduces Karmada as a modern multi‑cloud orchestration solution, complete with configuration examples, scheduling strategies, and CRD definitions.

CRDFederationKarmada

0 likes · 16 min read

Mastering Multi‑Cluster Management: From Kubernetes Federation v1/v2 to Karmada

Architect's Guide

Jun 26, 2022 · Backend Development

Building a Million‑Message‑Per‑Second RabbitMQ Service: Architecture, Scaling, and High Availability

This article explains how to design and operate a RabbitMQ cluster capable of handling millions of messages per second by describing RabbitMQ fundamentals, Google‑scale deployment, sharding and consistent‑hash plugins, high‑availability mirroring, federation, and integration with Spring AMQP, while also covering practical deployment scenarios and performance trade‑offs.

FederationMessage QueueRabbitMQ

0 likes · 23 min read

Building a Million‑Message‑Per‑Second RabbitMQ Service: Architecture, Scaling, and High Availability

ITPUB

May 7, 2022 · Big Data

How eBay Scaled HDFS to 800 PB Using Federation and Router‑Based Architecture

This article details eBay's evolution of its massive HDFS storage—from a single‑cluster design to ViewFS Federation, then to Router‑Based Federation—highlighting the performance bottlenecks, optimization techniques, FastCopy integration, and future plans for further scaling and automation.

FederationHDFSPerformance Optimization

0 likes · 11 min read

How eBay Scaled HDFS to 800 PB Using Federation and Router‑Based Architecture

IT Services Circle

Apr 3, 2022 · Cloud Native

Understanding Kubernetes Federation: kubefed and Karmada Multi‑Cluster Management

This article explains why Kubernetes single‑cluster scalability is limited to about 5,000 nodes, introduces the concept of multi‑cluster federation, compares the legacy kubefed project with the actively maintained Karmada solution, and shows how policies and replica‑scheduling enable flexible cross‑AZ deployments and failover.

Cloud NativeCluster ManagementFederation

0 likes · 13 min read

Understanding Kubernetes Federation: kubefed and Karmada Multi‑Cluster Management

IT Architects Alliance

Jan 14, 2022 · Operations

Scaling RabbitMQ to Million‑Message Throughput: Architecture, Plugins, and High‑Availability Practices

This article explains how to horizontally scale RabbitMQ clusters, use sharding and federation plugins, configure mirror queues and other high‑availability features, and apply practical patterns such as confirms, retries, and delayed delivery to achieve million‑level message throughput in production environments.

FederationMessage QueueRabbitMQ

0 likes · 23 min read

Scaling RabbitMQ to Million‑Message Throughput: Architecture, Plugins, and High‑Availability Practices

Architecture Digest

Jan 13, 2022 · Backend Development

Scaling RabbitMQ to Million‑Message Throughput: Architecture, Sharding, Federation, and High‑Availability Practices

This article explains how to horizontally scale RabbitMQ clusters to handle millions of messages per second by leveraging cluster modes, mirror queues, sharding plugins, consistent‑hash exchanges, federation, and high‑availability configurations, while also covering practical scenarios such as retries, delayed tasks, and Spring AMQP integration.

FederationMessage QueueRabbitMQ

0 likes · 22 min read

Scaling RabbitMQ to Million‑Message Throughput: Architecture, Sharding, Federation, and High‑Availability Practices

Open Source Linux

Jan 5, 2022 · Operations

Designing Scalable High‑Availability Prometheus Architectures

This article explains how to build both small‑scale and large‑scale high‑availability Prometheus setups using local and remote storage, federation, keepalived, and PostgreSQL + TimescaleDB adapters to ensure reliable monitoring and alerting across growing infrastructures.

FederationPrometheusRemote Storage

0 likes · 6 min read

Designing Scalable High‑Availability Prometheus Architectures

MaGe Linux Operations

Dec 1, 2021 · Operations

Scalable High‑Availability Prometheus: Small‑Scale to Massive Deployments

This article explains how Prometheus’s local storage limits scalability and how Remote Storage, federation, and high‑availability setups—using dual instances, keepalived, and adapters with PostgreSQL + TimescaleDB—can overcome data persistence and performance challenges for both small‑scale and large‑scale monitoring environments.

FederationPrometheusRemote Storage

0 likes · 5 min read

Scalable High‑Availability Prometheus: Small‑Scale to Massive Deployments

Ops Development Stories

Nov 8, 2021 · Cloud Native

How to Manually Deploy Prometheus Federation on Kubernetes – Step‑by‑Step Guide

This guide walks through manually deploying a Prometheus federation on Kubernetes, covering environment setup with sealos, creating storage classes, persistent volumes, ConfigMaps, StatefulSets, services, applying manifests, and verifying the federation to aggregate metrics across multiple clusters.

Cloud NativeFederationKubernetes

0 likes · 10 min read

How to Manually Deploy Prometheus Federation on Kubernetes – Step‑by‑Step Guide

Qingyun Technology Community

Sep 15, 2021 · Cloud Native

Why Enterprises Embrace Hybrid Multi‑Cloud and Kubernetes Multi‑Cluster Strategies

Enterprises adopt hybrid multi‑cloud architectures driven by high‑profile security incidents and regulatory demands, leveraging Kubernetes multi‑cluster capabilities such as disaster recovery, latency reduction, isolation, fault containment, and vendor‑lock‑in avoidance, with solutions like Federation v1/v2 and KubeSphere illustrated through real‑world case studies.

FederationKubeSphereKubernetes

0 likes · 13 min read

Why Enterprises Embrace Hybrid Multi‑Cloud and Kubernetes Multi‑Cluster Strategies

Efficient Ops

Jul 25, 2021 · Cloud Native

Why Enterprises Need Multi‑Cluster Kubernetes and How to Implement It

This article explains why modern enterprises adopt multiple Kubernetes clusters, covering single‑cluster capacity limits, hybrid‑cloud requirements, fault‑tolerance concerns, the benefits of multi‑cluster setups, architectural models, and community‑driven implementation patterns.

Cloud NativeFederationMulti-Cluster

0 likes · 9 min read

Why Enterprises Need Multi‑Cluster Kubernetes and How to Implement It

Programmer DD

Jun 13, 2021 · Operations

How to Build a High‑Availability Prometheus Setup Using Federation and Multi‑Remote‑Read

This article examines common misuse of Prometheus federation, explains its limitations, and presents a pure‑Prometheus solution using multi_remote_read to achieve high‑availability monitoring, including configuration examples, code analysis, and best‑practice recommendations for proper data aggregation and query merging.

FederationPrometheusmulti_remote_read

0 likes · 11 min read

How to Build a High‑Availability Prometheus Setup Using Federation and Multi‑Remote‑Read

Programmer DD

Apr 13, 2021 · Big Data

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

This article provides a comprehensive overview of HDFS—including its design goals, core components, data read/write workflows, high‑availability mechanisms, federation, storage policies, colocation benefits, and practical usage scenarios—explaining why it is the foundational distributed file system for large‑scale data processing.

Big DataFederationHDFS

0 likes · 17 min read

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

Big Data Technology Architecture

Mar 11, 2021 · Big Data

Challenges and Optimizations of Hive MetaStore at Kuaishou

This article details how Kuaishou tackled performance, scalability, and stability challenges of Hive MetaStore by introducing a BeaconServer hook architecture, read‑write separation, API refinements, traffic control, and federation designs, resulting in significant query efficiency and service reliability improvements.

FederationHiveRead-Write Separation

0 likes · 14 min read

Challenges and Optimizations of Hive MetaStore at Kuaishou

Cloud Native Technology Community

Mar 30, 2020 · Cloud Native

Building a Cloud‑Native Large‑Scale Distributed Monitoring System with Prometheus

This article explains how to design and implement a cloud‑native, large‑scale distributed monitoring system using Prometheus, covering its limitations, service‑level sharding, centralized storage, federation, and high‑availability strategies to overcome scaling challenges in Kubernetes environments.

Cloud NativeFederationPrometheus

0 likes · 12 min read

Building a Cloud‑Native Large‑Scale Distributed Monitoring System with Prometheus

DataFunTalk

Jan 2, 2020 · Big Data

ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations

This article presents an in‑depth overview of ByteDance’s large‑scale HDFS deployment, describing its unique access layer, metadata and data layers, the evolution through multiple growth stages, and the key architectural improvements such as NNProxy, DanceNN, lock redesign, startup acceleration, and slow‑node mitigation techniques.

Big DataByteDanceFederation

0 likes · 18 min read

ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations

Alibaba Cloud Native

Aug 6, 2019 · Cloud Native

Why Multi-Cluster Architecture Is the Future of Cloud‑Native Applications

This article explains the rise of multi‑cluster designs, outlines three common scenarios—cloud burst, disaster recovery, and active‑active—examines the complexities of application delivery across clusters, and details how Kubernetes and Alibaba Cloud’s ACK implement unified APIs, tunnel mechanisms, and high‑availability to enable true multi‑cloud operations.

ACKCluster TunnelFederation

0 likes · 19 min read

Why Multi-Cluster Architecture Is the Future of Cloud‑Native Applications

MaGe Linux Operations

Jul 5, 2019 · Cloud Native

Building a Scalable, High‑Availability Kubernetes Monitoring System with Prometheus and OpenTSDB

This article details Xiaomi's end‑to‑end, highly available Kubernetes monitoring solution that combines Prometheus, OpenTSDB, and Falcon to handle massive dynamic metrics, ensure persistent storage, and support seamless scaling across multiple clusters.

Cloud NativeFederationKubernetes

0 likes · 16 min read

Building a Scalable, High‑Availability Kubernetes Monitoring System with Prometheus and OpenTSDB

Beike Product & Technology

Jun 28, 2019 · Big Data

Hadoop NameNode Performance Bottlenecks and Solutions: Federation, ViewFS, FastCopy, Balance & Mover

This article analyzes the performance and stability bottlenecks of a Hadoop 2.7.3 NameNode caused by memory limits, RPC QPS, and long restart times, and presents a comprehensive solution stack—including HDFS federation, ViewFS, FastCopy, and tuned Balance/Mover tools—to improve scalability and reduce downtime.

BalanceFastCopyFederation

0 likes · 11 min read

Hadoop NameNode Performance Bottlenecks and Solutions: Federation, ViewFS, FastCopy, Balance & Mover

Qunar Tech Salon

May 16, 2019 · Big Data

Optimizing HDFS Federation Data Migration with FastCopy and qFastCopy at Qunar

This article describes the challenges of scaling Qunar's Hadoop NameNode, introduces HDFS Federation and the FastCopy tool, presents performance tests comparing FastCopy with DistCp, and details the development and evaluation of an optimized qFastCopy solution that reduces multi‑petabyte migration time from hours to a few.

Big DataData MigrationFastCopy

0 likes · 8 min read

Optimizing HDFS Federation Data Migration with FastCopy and qFastCopy at Qunar

dbaplus Community

May 13, 2019 · Big Data

Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com

This article examines the performance challenges encountered after upgrading a large‑scale HDFS cluster at VIP.com, explains the root causes of NameNode RPC latency, and presents concrete solutions—including delayed block reports, configurable block deletion, federation redesign, client monitoring, temp‑directory sharding, and small‑file handling—along with configuration snippets and real‑world results.

Big DataFederationHDFS

0 likes · 13 min read

Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com

Meituan Technology Team

Apr 14, 2017 · Big Data

Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation

Meituan‑Dianping migrated its 2,000‑node HDFS cluster to Federation by fixing ViewFs compatibility, simplifying mount points, leveraging FastCopy for massive data moves, improving token handling, and automating split‑workflow steps, thereby overcoming single‑NameNode bottlenecks and providing a practical blueprint for large‑scale Hadoop deployments.

Big DataFastCopyFederation

0 likes · 22 min read

Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation

Art of Distributed System Architecture Design

Nov 20, 2015 · Big Data

Design and Implementation of Alibaba Cloud's Cross‑Data‑Center Hadoop Cluster

In 2013 Alibaba Cloud faced full rack capacity in a single IDC, prompting the development of a multi‑NameNode, cross‑data‑center Hadoop solution that overcomes NameNode scalability, inter‑site bandwidth limits, data placement, job scheduling, massive data migration, and user transparency challenges.

Cross‑Data‑CenterFederationHadoop

0 likes · 14 min read

Design and Implementation of Alibaba Cloud's Cross‑Data‑Center Hadoop Cluster