Artificial Intelligence 17 min read

Sunfish: An Integrated AI Platform for Model Training and Online Service Deployment at Youzan

Sunfish is Youzan’s integrated AI platform that unifies visual drag‑and‑drop model training, notebook‑based algorithm development, automated model management and one‑click publishing with a low‑latency, high‑availability “small‑box” inference service, enabling end‑to‑end deep‑learning workflows from data exploration to online recommendation and risk‑control deployment.

Youzan Coder
Youzan Coder
Youzan Coder
Sunfish: An Integrated AI Platform for Model Training and Online Service Deployment at Youzan

Machine learning and deep learning are increasingly used in Youzan’s business scenarios such as marketing, recommendation, and risk control. Beyond data and algorithms, engineering support—fast model building, evaluation, and stable online serving—is essential. To meet these needs, Youzan built the Sunfish intelligent platform, which this article describes in detail.

Background : In Youzan’s recommendation system, a two‑stage process (recall and online ranking) relies on deep‑learning models for inference. The workflow includes data exploration, model training/evaluation, and model service deployment. Sunfish provides a one‑stop solution from training to deployment.

Sunfish Functional Architecture : Sunfish consists of a visual model‑training platform and a “small‑box” online model‑service platform. The visual platform offers rapid drag‑and‑drop modeling, notebook‑based custom algorithm development, and model management/publishing. The small‑box platform delivers low‑latency, high‑availability inference with continuous integration, A/B testing, and extensible plugins.

Visual Model‑Training Platform :

Rapid visual modeling: users create experiments by dragging components, configure parameters, monitor logs, and handle fault‑tolerant execution.

Algorithm development & sharing: supports both notebook‑style coding and component publishing.

Model management & publishing: automatically saves trained models, allows uploading existing models, and enables one‑click deployment to the small‑box service.

The platform’s system architecture includes stateless Master and Worker nodes. Masters handle experiment, component, and model metadata; Workers fetch component code from GitLab, execute tasks, and provide logs. Zookeeper ensures fault‑tolerant coordination, while the small‑box platform (AlgorithmBox) handles model publishing.

Experiment Execution : An experiment is compiled into a DAG of Tasks (Plan). The PlanScheduler orchestrates task execution using Runnable and Running queues, persisting state in Zookeeper for master failover. Workers retrieve task metadata from MySQL, pull code from GitLab, and execute Python or module tasks. Users can view real‑time logs and TensorBoard visualizations.

Model Management & Publishing : After training, models are stored and can be published to the small‑box platform. Published models become callable via Dubbo or HTTP. Service management allows scaling, version control, and health monitoring.

Notebook Integration : JupyterLab is embedded in Sunfish, offering both Python 2 and Python 3 kernels, PySpark access to offline data, and workspace isolation.

Small‑Box Model Service Platform :

Architecture: Manager, Master, and Worker roles. Managers handle model registration, routing, and cluster management; Masters route inference requests; Workers load models (e.g., TensorFlow‑Serving Docker) and serve predictions.

Routing: Static routes (desired worker set) and dynamic routes (actual worker availability) are merged via Zookeeper to form the final routing table.

Request Processing: Requests specify a Plan composed of Stages (each invoking a model service). Stages may contain parallel SubStages. The Master creates a Session, fetches the Plan from Apollo, and executes stages via Workers, optionally loading custom plugins for pre/post‑processing.

Outlook : Sunfish is still early‑stage. Future work includes feature expansion (scheduled tasks, dynamic plan selection), usability improvements (auto‑tuning, template experiments), and deeper business enablement (feature/model libraries, lifecycle management).

Contact: For collaboration, reach out to [email protected].

machine learningmlopsmodel trainingAI Platformmodel servingYouzanSunfish
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.