Evolution and Architecture of Vivo's Game Recommendation System
This article chronicles the development, architectural challenges, and engineering solutions of a large‑scale game recommendation platform, covering background, initial models, business growth, caching strategies, GC optimization, rate‑limiting, fine‑grained operations, multi‑path recall, A/B testing, and future intelligent enhancements.
Authors: Vivo Internet Server Team – Ke Jiachen, Wei Ling
The article introduces the development history of a game recommendation project, discusses business and architectural challenges encountered in large‑scale systems, and presents solutions implemented by engineers, offering valuable references for similar projects.
1. Background and Significance of Game Recommendation
Search and recommendation are the two main ways users acquire information; recommendation systems passively connect users with content, saving time and cost, which led to the creation of game recommendation systems.
The system distributes games across major traffic entrances (Game Center, App Store, Browser, Jovi, etc.) using various recommendation algorithms and strategies to recommend high‑intent, commercially valuable games, later extending to content and material recommendations.
2. Initial Model of Game Recommendation
The goal is to launch games that users want while ensuring commercial value, with commercial value controlled by operational rules and user intent derived from algorithmic ranking based on feature data and feedback.
The model consists of four parts:
Operational recommendation rule configuration
Algorithm model training
Recommendation strategy activation
Data point reporting
Before a strategy takes effect, operations generate configuration rules stored in cache for the recommendation high‑concurrency interface. When a user accesses a specific page, the backend requests the recommendation service with scene information, maps to relevant configurations (recall, tags, expiration, algorithms, etc.), calls the algorithm service for sorting, and returns results to the app, while reporting user behavior and recommendation data.
3. Business Growth and Architecture Evolution
As more business lines adopt the recommendation system, functionality expands, covering scenarios such as categories, topics, rankings, homepages, and search, with strategies like intervention, dispersion, resource allocation, and guaranteed volume, supporting various recommendation types (co‑operated games, mini‑games, content, reasons).
These diverse scenarios increase complexity, posing new challenges in performance, scalability, and availability, driving architectural changes.
3.1 General Composite Strategy in an Entropic Environment
During the 0‑to‑1 phase, the focus was on increasing distribution volume using a layered architecture. In the 1‑to‑2 phase, the system also recommends content and materials, requiring flexible strategy invocation, dynamic rules, and user‑personalized logic, leading to the need for a highly reusable, extensible, low‑code strategy framework.
The composite strategy involves two roles—acceptor and executor—communicating via a recommendation context. Matchers, listeners, and processes implement various logic. The acceptor selects a strategy template, the listener performs preprocessing, the executor runs processes based on preconditions, and results are logged.
3.2 Multi‑Level Cache and Near‑Real‑Time Strategy
The system handles peak traffic of ~30k TPS, with read‑heavy workloads. To ensure read performance, a Redis + local cache design is used: configuration updates write to MySQL, then Redis; local cache expires lazily, loading from Redis as needed, guaranteeing eventual consistency.
As node count grew, consistency delays increased. To address stale caches during configuration changes, a message queue and version comparison were added for real‑time synchronization.
3.3 High‑Concurrency Service Garbage Collection Handling
Frequent Full GC (FGC) in Java services caused latency spikes. The team applied several measures:
Move rarely changed caches (hour‑level) off‑heap.
Skip updates for infrequently changing caches (minute‑level) when values are unchanged.
Switch to G1 GC with tuned parameters:
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=25
-XX:MaxNewSize=3072M -Xms4608M -Xmx4608M -XX:MetaspaceSize=512M
-XX:MaxMetaspaceSize=512M3.4 Rate Limiting, Degradation, and Fallback Strategies
Components like Hystrix, Sentinel, and Resilience4j are used for circuit breaking and rate limiting, but the team also implements layered rate limiting to prioritize critical services. For personalized fallback, historical user data is stored and used to generate tailored fallback recommendation lists.
4. Exploration of Fine‑Grained Operation Modes
After the 0‑to‑1 expansion and 1‑to‑2 rapid growth phases, the architecture stabilizes, prompting focus on efficiency and cost reduction through fine‑grained operational design, including a layered orthogonal experiment platform.
4.1 Multi‑Layer Hash Orthogonal Experiment Platform
Accurate recommendation requires iterative strategy refinement based on data feedback. Traditional A/B testing via traffic isolation is unsuitable for complex scenarios, so a multi‑layer hash approach assigns experiments using a hash function on a layer identifier, ensuring random, independent traffic distribution across layers.
The experiment lifecycle includes preparation (hypothesis, baseline/experiment traffic split, configuration), execution (traffic routing, strategy execution, data reporting), and analysis (big‑data aggregation, metric evaluation, strategy iteration).
Experiment modules consist of configuration (mapping traffic to scenarios), data reporting (SDK‑based logging of game and request dimensions), and result analysis (big‑data processing to guide strategy adjustments).
{"code":0,"data":[{"score":0.016114970572330977,"data":{"gameId":53154,"appId":1364982,"recommendReason":null},"gameps":"埋点信息"}],"reqId":"20200810174423TBSIowaU52fjwjjz"}
{"reqId":"20200810142134No5UkCibMdAvopoh","scene":"appstore.idx","imei":"869868031396914","experimentInfo":[{"experimentId":"RECOMMENDATION_SCENE","salt":"RECOMMENDATION_SCENE","imei":"3995823625","sinfo":"策略信息"},{"experimentId":"AUTO_RECOMMENDATION_REASON","salt":"RECOMMENDATION_SCENE","imei":"1140225751","sid":"3,4,5"}]}4.2 Multi‑Path Recall Optimization
Recall narrows a massive candidate pool to a manageable size for ranking. Single‑path recall suffers from limited coverage; multi‑path recall combines personalized, algorithmic, pool‑based, and tag‑based recalls, merging, filtering, and truncating results before scoring.
4.3 Dynamic Parameter Adjustment for Exposure
Algorithm performance is evaluated by exposure, download, CTR, etc. To react quickly to business needs, a dynamic tuning mechanism adjusts exposure in real time based on collected metrics, categorizing games into tiers and time windows, computing weight factors, and modifying exposure accordingly.
5. Outlook: Intelligent Construction
Future work aims to build a full‑stack support system covering search, intelligent operations, smart coupons, push, and user feedback processing, further enhancing platform value beyond simple distribution.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.