Cloud Computing 14 min read

Refactoring the Hybrid‑Cloud Account Login System: Challenges, Strategies, and Implementation

To meet four‑nine stability while unifying divergent on‑premise and public‑cloud login services, the team refactored the hybrid‑cloud account system by consolidating codebases, standardizing DAO layers, deploying gray releases across three active‑active sites, cutting development time by half, latency by 31 %, and technical debt.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Refactoring the Hybrid‑Cloud Account Login System: Challenges, Strategies, and Implementation

The account login system is a core component of a game distribution platform, responsible for user registration, login, real‑name verification, anti‑addiction, privacy compliance, and risk control. Because it is the first online conversion point, its stability requirements are extremely high.

To meet these requirements, the platform adopted a two‑region three‑center active‑active architecture early on, centering on the company’s own data center and extending to public‑cloud regions in East China and South China. The hybrid‑cloud model brings flexibility and cost efficiency but also introduces two major challenges:

Data architecture: Storing data in the cloud raises leakage risks and creates API capability gaps (e.g., Bilibili‑authorized login cannot be fully supported in the cloud data center).

PaaS differences: While the company’s on‑premise infrastructure (DB management, KV store, message queue) is mature, the public‑cloud deployments must rely on cloud‑native services, leading to inconsistencies that were partially mitigated by a brittle anti‑corrosion layer.

These challenges resulted in two separate code repositories (login‑idc‑api and login‑cloud‑api) and increased development, testing, and deployment cycles by 30‑40%.

Challenges

Importance: Supports tens of millions of MAU and 113 SDK versions.

Stability: Must maintain four‑nines SLO during refactoring.

Complexity: Seven years of rapid evolution have accumulated substantial technical debt.

Value of Refactoring

Efficiency: Development speed improves by ~50%, iteration speed by 30‑40%.

Cost: Reduces manpower for a projected five‑year lifecycle.

Quality: Code cyclomatic complexity drops by 40%; core‑link latency improves by 31.5%.

Culture: Reinforces a “relentless execution” mindset.

Strategic Directions

Three possible solutions were evaluated:

Unify all login services to the company data center (long‑term, but not feasible short‑term due to heavy cloud dependencies).

Bridge data architecture between the company and public clouds (risk of data leakage, lower ROI).

Refactor the divergent codebases to achieve compatibility and merge them (chosen approach).

Thought Process

The refactoring follows Martin Fowler’s definition: restructuring software without changing its observable behavior. Observable behavior is interpreted as API contracts, external service dependencies (Bilibili account service, third‑party verification, PaaS components), and cross‑service interactions.

Key Insight

High‑availability traffic switching in production shows that the most complex parts (Controller and Service layers) can be abstracted away, allowing the focus to shift to the DAO layer, where differences stem from hybrid‑cloud data‑store variations.

Implementation Details

• Use the company data‑center repository as the baseline. • Deploy the same DB schema across all three sites; reuse existing DB instances. • Provision isolated Redis clusters for the public‑cloud zones to avoid data contamination during gray releases. • Align DAO implementations to select the appropriate zone‑specific Bilibili API at runtime, following a hexagonal‑architecture style.

Data Validation – Product Perspective

Compare write/read behavior for user‑level data between old and new clusters.

Compare write/read behavior for game‑level data.

Run full SDK regression tests against a test domain.

Execute full API unit‑test coverage for non‑SDK endpoints.

Data Validation – Data Perspective

Database: Use scheduled jobs to compare source and target data sets.

Cache: Deploy isolated cache clusters during gray rollout; roll back without affecting business logic.

Telemetry/Reports: Monitor game‑level reporting trends and compare with historical data.

Release Plan

The release follows strict safety guidelines: it must be gray‑able, observable, and recoverable.

Gray Deployment

Step 1: Gray rollout in the company data center, batch traffic, immediate rollback on anomalies.

Step 2: Full release in public‑cloud A (and similarly in B) without traffic, then gradually shift traffic using SLB rules based on domain, importance, and API call volume.

Observability

Business & SLO metrics.

Log aggregation.

Performance monitoring (API success rates, error codes, API latency, JVM metrics).

Recoverability

Immediate rollback in the company data center without dirty data.

SLB‑based rollback for public‑cloud zones, redirecting traffic back to original clusters.

Separate Redis instances per zone to isolate key namespaces and prevent cross‑zone contamination.

References

“Bilibili Security Production Practice” – https://mp.weixin.qq.com/s/tj1PEUWAyRZ1QzeW_oCWfg

Martin Fowler, “Refactoring: Improving the Design of Existing Code”

“Complexity Has to Live Somewhere” – https://ferd.ca/complexity-has-to-live-somewhere.html

Performance Optimizationbackend architecturehybrid-cloudaccount loginsystem refactoring
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.