Artificial Intelligence 11 min read

Model Deployment Challenges and a Seldon‑Based Cloud‑Native Solution

The team replaced the cumbersome ABox deployment stack with Seldon‑based cloud‑native serving on Kubernetes, unifying TensorFlow and other framework models, adding GPU sharing, automated CRUD, per‑model ingress, monitoring, and log collection, achieving scalable, fault‑tolerant, zero‑downtime model deployment.

Youzan Coder

Jan 17, 2022

Model Deployment Challenges and a Seldon‑Based Cloud‑Native Solution

Model deployment is the "last mile" of algorithm engineering, bringing high complexity for algorithm teams. It requires efficient deployment, version management, load balancing, fault tolerance, scalability, resource isolation, rate limiting, and metric monitoring, which are typically the strengths of engineering teams rather than algorithm teams.

The existing architecture, called ABox, consists of three modules: master, worker, and manager. Master handles request routing via Zookeeper and executes user‑defined JARs for pre‑ and post‑processing. Worker registers itself, reports heartbeats, and pulls models for TensorFlow‑Serving. Manager registers servers, creates models, provides UDL update interfaces, and manages clusters, business groups, deployments, and third‑party algorithms.

Several pain points were identified:

High operational effort: manual scaling, limited instance count tied to worker nodes, manual URL registration for non‑TensorFlow services, and OOM issues with containerized TensorFlow‑Serving.

Load imbalance across workers.

Resource contention: hot models monopolize CPU/memory.

Lack of universality: fragmented management of TensorFlow and other framework models, no GPU model deployment.

To address these issues, the team introduced Seldon, an open‑source cloud‑native model serving platform built on Kubernetes. Seldon provides a unified way to serve models from various frameworks (TensorFlow, Scikit‑Learn, MLflow, Triton, etc.) and supports custom inference servers.

The core of Seldon is the Seldon Core Controller (an operator) that manages the custom resource SeldonDeployment. It handles CRUD operations for deployments, services, and virtual services, and can integrate with Prometheus and Jaeger for monitoring and tracing.

Seldon distinguishes between Reusable Model Servers (which download models from external storage) and Non‑Reusable Model Servers (which embed the model in a custom image). An example TensorFlow model deployment (mnist.yaml) is shown below:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: tfserving
spec:
  name: mnist
  predictors:
  - graph:
      children: []
      implementation: TENSORFLOW_SERVER
      modelUri: gs://seldon-models/tfserving/mnist-model
      name: mnist-model
      parameters:
        - name: signature_name
          type: STRING
          value: predict_images
        - name: model_name
          type: STRING
          value: mnist-model
    name: default
    replicas: 1

Key design changes when integrating Seldon into the internal K8s environment include:

Retaining the ABox master as the Dubbo entry point, converting requests to HTTP for the K8s ingress controller.

Replacing the previous ingress controller with Nginx Ingress and generating per‑model ingress rules.

Adding an HDFS‑Initializer to support hdfs:// model URIs for reusable model servers.

Adopting Tencent Cloud's GpuManager for GPU virtualization and sharing (instead of vGPU, due to OS compatibility and feature limitations).

Embedding a K8s client in the algorithm platform to perform CRUD on Seldon Deployments and Ingress resources.

Providing log collection for custom images via Filebeat → Kafka → custom log‑server, with HTTP access to recent logs.

Implementing resource usage monitoring (CPU, memory) per pod and exposing real‑time metrics in the UI.

Additional operational notes:

TensorFlow‑Serving requires AVX/AVX2 instruction sets; on some KVM VMs it crashes with “Illegal instruction”. The container can be started manually with:

/usr/bin/tensorflow_model_server --port=9000 --model_name=xxx --model_base_path=/path/to/model

The migration to the new architecture was performed gradually across QA, pre‑release, and production environments, using traffic‑splitting switches to ensure zero‑downtime.

Future work includes supporting inference graphs, advanced rollout and traffic‑splitting strategies, automatic rebuilding of custom model server images, and additional model initializers (e.g., object storage).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Model Deployment Kubernetes GPU AI serving Seldon

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.