Model Deployment Challenges and a Seldon‑Based Cloud‑Native Solution
The team replaced the cumbersome ABox deployment stack with Seldon‑based cloud‑native serving on Kubernetes, unifying TensorFlow and other framework models, adding GPU sharing, automated CRUD, per‑model ingress, monitoring, and log collection, achieving scalable, fault‑tolerant, zero‑downtime model deployment.
Model deployment is the "last mile" of algorithm engineering, bringing high complexity for algorithm teams. It requires efficient deployment, version management, load balancing, fault tolerance, scalability, resource isolation, rate limiting, and metric monitoring, which are typically the strengths of engineering teams rather than algorithm teams.
The existing architecture, called ABox, consists of three modules: master, worker, and manager. Master handles request routing via Zookeeper and executes user‑defined JARs for pre‑ and post‑processing. Worker registers itself, reports heartbeats, and pulls models for TensorFlow‑Serving. Manager registers servers, creates models, provides UDL update interfaces, and manages clusters, business groups, deployments, and third‑party algorithms.
Several pain points were identified:
High operational effort: manual scaling, limited instance count tied to worker nodes, manual URL registration for non‑TensorFlow services, and OOM issues with containerized TensorFlow‑Serving.
Load imbalance across workers.
Resource contention: hot models monopolize CPU/memory.
Lack of universality: fragmented management of TensorFlow and other framework models, no GPU model deployment.
To address these issues, the team introduced Seldon, an open‑source cloud‑native model serving platform built on Kubernetes. Seldon provides a unified way to serve models from various frameworks (TensorFlow, Scikit‑Learn, MLflow, Triton, etc.) and supports custom inference servers.
The core of Seldon is the Seldon Core Controller (an operator) that manages the custom resource SeldonDeployment . It handles CRUD operations for deployments, services, and virtual services, and can integrate with Prometheus and Jaeger for monitoring and tracing.
Seldon distinguishes between Reusable Model Servers (which download models from external storage) and Non‑Reusable Model Servers (which embed the model in a custom image). An example TensorFlow model deployment (mnist.yaml) is shown below:
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: tfserving
spec:
name: mnist
predictors:
- graph:
children: []
implementation: TENSORFLOW_SERVER
modelUri: gs://seldon-models/tfserving/mnist-model
name: mnist-model
parameters:
- name: signature_name
type: STRING
value: predict_images
- name: model_name
type: STRING
value: mnist-model
name: default
replicas: 1Key design changes when integrating Seldon into the internal K8s environment include:
Retaining the ABox master as the Dubbo entry point, converting requests to HTTP for the K8s ingress controller.
Replacing the previous ingress controller with Nginx Ingress and generating per‑model ingress rules.
Adding an HDFS‑Initializer to support hdfs:// model URIs for reusable model servers.
Adopting Tencent Cloud's GpuManager for GPU virtualization and sharing (instead of vGPU, due to OS compatibility and feature limitations).
Embedding a K8s client in the algorithm platform to perform CRUD on Seldon Deployments and Ingress resources.
Providing log collection for custom images via Filebeat → Kafka → custom log‑server, with HTTP access to recent logs.
Implementing resource usage monitoring (CPU, memory) per pod and exposing real‑time metrics in the UI.
Additional operational notes:
TensorFlow‑Serving requires AVX/AVX2 instruction sets; on some KVM VMs it crashes with “Illegal instruction”. The container can be started manually with: /usr/bin/tensorflow_model_server --port=9000 --model_name=xxx --model_base_path=/path/to/model
The migration to the new architecture was performed gradually across QA, pre‑release, and production environments, using traffic‑splitting switches to ensure zero‑downtime.
Future work includes supporting inference graphs, advanced rollout and traffic‑splitting strategies, automatic rebuilding of custom model server images, and additional model initializers (e.g., object storage).
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.