Artificial Intelligence 12 min read

vivo Machine Learning Platform: Architecture Design and Practice

vivo’s machine‑learning platform, built for its massive app‑store and e‑commerce ecosystem, streamlines data processing, model training, and deployment through quota‑based resource management, a custom ultra‑large‑scale TensorFlow‑vlps framework, OpenAPI‑driven training, and Jupyter‑integrated interactive development, boosting efficiency for billions of samples and features.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
vivo Machine Learning Platform: Architecture Design and Practice

This article introduces vivo's machine learning platform and its practical applications in supporting internet business scenarios including game distribution, app store, e-commerce, and content recommendation.

Business Background: As of August 2022, vivo has 280 million networked users and over 70 million daily active users on its app store. The platform addresses challenges in making recommendation system model iteration more efficient while balancing cost, efficiency, and user experience.

Platform Architecture: The machine learning platform is organized into three main steps: Data Processing (feature and sample data support), Model Training (model training and output), and Model Deployment (online model inference). The article focuses on model training challenges and optimization approaches.

Key Platform Capabilities:

Resource Management: Implements quota-based resource control to prevent abuse, with quota groups and personal quotas supporting temporary expansion and sharing. Introduces multi-dimensional scheduling scoring mechanisms covering maximum runtime, queue time, CPU/memory/GPU granularity, and total resource demand.

Framework Self-Development: Evolved from native TensorFlow to a self-developed ultra-large-scale training framework (vlps), now using a TensorFlow+vlps combined new framework.

Training Management: Supports uploading code to platform file servers or git, with configurable parameters for distributed training tasks. Provides OpenAPI support for developers to complete ML tasks without console access.

Interactive Development: Offers Jupyter notebook integration for interactive experimentation, with resource reservation capabilities and timeout settings.

Current Results: The platform has covered internal algorithm engineers' model debugging work, reaching scale of hundreds of millions of samples and billions of features.

Future Directions: Platform capability integration, strengthening framework pre-research, and providing preset parameters to achieve decoupling of algorithms, engineering, and platform.

Feature Engineeringmodel deploymentmlopsTensorFlowVivoResource Schedulingdistributed trainingmachine learning platform
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.