HBase Cloud Migration: Architecture, Challenges, and Solutions
This technical report details the background, architecture, construction, core issues, migration plans, and future roadmap of moving 58's HBase clusters to a cloud‑native environment, highlighting cost reduction, operational automation, and performance optimizations.
1. Background Introduction
58's big data team has achieved good results with offline mixed‑mode projects and plans to gradually integrate big‑data components with cloud technology. In the first half of 2024, the team intends to cloud‑ify the HBase clusters to further lower business costs and reduce operational maintenance overhead.
1.1 HBase Cloud Background
In 58's business scenarios, HBase plays a crucial role by storing posts, user information, and other core data that are periodically synchronized for random queries and deep analytics. It also supports batch queries for user profiling, trend analysis, recommendation, search, and time‑series storage. The overall architecture is shown below.
With the continuous growth of business, several problems have been observed in the long‑term operation of the HBase cluster:
Resource waste due to uniform specifications across diverse business types.
Complex group management leading to high operational cost and poor scalability.
Complicated version upgrades increasing iteration and maintenance costs.
To address these issues, a cloud‑native deployment with containerization is considered to improve resource utilization and reduce user costs. The overall cloud migration goals are:
Fully cloud‑ify the existing physical HBase clusters.
Achieve a 30% cost reduction while maintaining the same read/write performance as the physical clusters.
Integrate development and operations to converge overall operational capabilities and lower maintenance costs, supporting efficient iteration.
2. HBase Cloud Cluster Construction
2.1 HBase Architecture Introduction
HBase is a high‑reliability, high‑performance, column‑oriented, scalable distributed KV storage system that can build large‑scale structured storage clusters on inexpensive PC servers. Its architecture follows a classic Master‑Slave model with a Master node managing the cluster and many RegionServer nodes handling user reads and writes. All data is ultimately stored in HDFS, and a ZooKeeper node assists the Master in cluster management. The overall architecture is illustrated below.
Key features: the Master maintains metadata and cluster management, while RegionServers store data and can be horizontally scaled. RegionServers support multi‑tenant rsgroup partitioning and robust permission management, making the native HBase architecture well‑suited for cloud platforms.
2.2 58 Cloud Platform Introduction
The 58 Cloud Platform provides a one‑stop cloud solution integrating common cloud services, service governance, middleware, and storage. It consolidates all foundational capabilities and is divided into three major dimensions: cluster management, middleware, and storage.
The platform offers image management, release publishing, configuration management, container management, automatic scaling, group management, rollback, permission quotas, resource pool management, log management, and monitoring. It fully supports cloud‑native capabilities, including micro‑service architecture, Docker containers, Kubernetes scheduling and orchestration, declarative APIs, and immutable infrastructure. HBase cloud migration can leverage these capabilities for automated deployment, upgrades, and DevOps integration, while exploiting elastic resources for cost optimization.
2.3 Cluster Construction
The overall HBase cloud cluster construction leverages HBase's inherent scalability and high availability, combined with the 58 Cloud Platform's integrated management to achieve rapid cloud integration and cost optimization while ensuring business stability. Core work includes adapting to a higher‑version Hadoop, integrating 58 Cloud Platform capabilities, developing stability‑related features, operational maintenance features, and resource‑optimization tasks.
Adopt independent Master cloud cluster + independent RegionServer cloud clusters with multiple cloud groups.
Configure different container specifications per business type to maximize resource utilization.
Adapt to Hadoop 3, unify underlying storage architecture, improve service stability, and reduce operational costs.
Leverage the Cloud Platform's automated deployment, upgrade, and management capabilities to lower maintenance overhead.
Enhance operational capabilities with one‑click business onboarding, automatic scaling, table claiming, and handover features.
Construction 1: Evolving Cloud Cluster Architecture
Two architectures were explored. The first is a multi‑cluster design where Master and RegionServer clusters are deployed separately, each RegionServer cloud cluster mapping to a physical business group. The second is a single‑cluster multi‑group design where a single Master and RegionServer cloud cluster host multiple groups. The multi‑group architecture was chosen for its maintenance simplicity and cost efficiency.
Construction 2: Adapting Cloud Platform – Enhancing Basic Capabilities
Integrated deployment, configuration management, monitoring, alarm, fault handling, log management, and custom resource scheduling to ensure cluster stability while optimizing costs.
Construction 3: Cluster Capability Evolution
Adapt to higher‑version Hadoop clusters to improve stability and maintenance efficiency.
Enable HBase replication for online migration, ensuring transparent data migration across all business scenarios.
Upgrade routing service architecture to be cloud‑compatible and enhance auditing, allowing transparent migration for all callers.
Construction 4: Operations Management Adaptation and Capability Enhancement
Standardize operational processes (e.g., work‑order scaling) to improve efficiency and service safety.
Optimize onboarding with one‑click access, metadata claiming, and handover to enhance business entry efficiency and usability.
2.4 Core Issues & Solutions
(1) Frequent eviction of RegionServer containers : Cloud‑native scheduling based on actual memory usage can cause over‑allocation on physical nodes, leading to unstable memory utilization and frequent evictions for memory‑intensive RegionServers.
Solution: Adopt spec‑based scheduling where container resources are defined at creation, preventing over‑commitment and stabilizing resource usage.
(2) Group node migration difficulty : Manual RsGroup management becomes impractical in a dynamic cloud environment with fluctuating IPs.
Solution: Automate group migration by detecting the cloud group at container startup and invoking HBase APIs to move the node to the target RsGroup.
(3) Service fault recovery delay : Strict IP constraints prevent containers from restarting quickly, causing prolonged downtime.
Solution: Reserve a pool of container IPs per group and allow flexible IP assignment during restarts, enabling rapid recovery.
(4) Configuration management : Puppet relies on known IPs, which is unsuitable for cloud containers.
Solution: Pull configuration files at container startup, ensuring each restart obtains the latest configuration.
(5) ZGC memory statistics inaccuracy : ZGC uses shared memory (shmem) counted as full usage, while the cloud platform only monitors RSS, leading to misleading low utilization metrics.
Solution: Enable file_zeropage=1 in kernel parameters to correct top command statistics, and adjust container memory accounting to include RSS plus proportional shmem.
(6) Smooth node restart : Container termination abruptly kills RegionServer processes, causing client retries and cache misses.
Solution: Integrate HBase graceful shutdown scripts into container termination: disable auto‑balancing, evict regions, re‑enable balancing, then destroy the container, minimizing impact on clients.
3. HBase Business Migration
3.1 Migration Plans
Three migration modes are defined:
Offline mode : Used for batch import scenarios; migration is scheduled outside write windows.
Routing dual‑write mode : Platform‑side transparent routing writes, invisible to business.
Replication dual‑write mode : Supports online migration via HBase API and routing, ensuring data safety and business transparency.
3.2 Migration Issues and Handling
(1) Increased load during migration affecting stability : Large tables consume significant network and disk I/O during copy.
Solution: Apply bandwidth throttling and perform migration in phased batches to prioritize service quality.
(2) Degraded access efficiency after migration : Hot data may miss cache, causing higher latency.
Solution: Pre‑warm hot data and perform gradual, transparent migration based on p99 latency thresholds.
(3) Incomplete routing caller migration : Weak audit leads to missing calls after migration.
Solution: Enhance auditing and use a DAG algorithm to compute migration groups, ensuring no caller is left inaccessible.
(4) Major compaction resource risk : Major compaction on large tables can strain resources in a container‑dense cloud environment.
Solution: Execute major compaction per table per RegionServer with controlled concurrency and resource limits.
4. Cloud‑Native Summary
4.1 Project Benefits
The project delivers three major benefits: cost optimization and higher resource utilization from HBase cloudification, operational efficiency through automated DevOps, and architectural advancement via unified Hadoop version migration.
4.2 Future Plans
Future work focuses on capability enhancement and further cost optimization, such as integrating middleware configurations and improving diagnostic capabilities.
5. Author Introduction
Li Ying, Fan Le, Shen Chenhang – R&D staff in the Data Architecture Department focusing on storage. Thanks to the Cloud Platform and Operations teams for their strong support.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.