Artificial Intelligence 12 min read

Integrating Distributed TensorFlow with Kubernetes: Architecture and Deployment

The article explains how to combine Distributed TensorFlow with Kubernetes—using GlusterFS storage, Deployments for parameter servers, Jobs for workers, service discovery, monitoring, and a Jinja2‑generated YAML template—to create isolated, scalable training clusters with Jupyter and TensorBoard access.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Integrating Distributed TensorFlow with Kubernetes: Architecture and Deployment

TensorFlow (70K+ GitHub stars) and Kubernetes (27K+ stars) are leaders in deep learning and container orchestration. This article reviews the integration of TensorFlow running on Kubernetes, discussing motivations, architecture, and practical deployment details.

1. Distributed TensorFlow

In April 2016 TensorFlow 0.8 introduced Distributed TensorFlow, enabling training across multiple servers. Large models such as the 68‑billion‑parameter MOE layer require distributed training to be feasible. Distributed TensorFlow allows a TensorFlow cluster to accelerate training by leveraging many machines.

2. Why TensorFlow on Kubernetes

Although Distributed TensorFlow provides scalability, it lacks resource isolation and suffers from parameter‑server (PS) lifecycle issues. Kubernetes excels at isolation, scheduling, and service discovery, making it a natural platform for running TensorFlow clusters.

The authors chose GlusterFS as the distributed storage backend because its read performance on HDFS was insufficient for their workloads.

3. Integrated Architecture

The architecture supports both Between‑Graph and In‑Graph replication scenarios. PS tasks are deployed as Kubernetes Deployments, while worker tasks run as Jobs. Service discovery is handled by Kubernetes Service and KubeDNS. Each TensorFlow cluster creates two PersistentVolumes (PV) via a StorageClass that integrates with GlusterFS through Heketi: one for training data (/data) and one for logs (/log). Users receive isolated namespaces, Jupyter Notebook services (exposed via NodePort), and optional TensorBoard services.

4. Core Components

TensorFlow 1.3.0, Kubernetes 1.7.4, Docker 1.12.6, GlusterFS 3.10.5, Harbor 1.1.2, Contiv netplugin, Keepalived, HAProxy, Etcd2/3, fluentd + Kafka + Elasticsearch + Kibana for logging, and cAdvisor + Prometheus + Grafana for monitoring.

5. Demo

A demo based on Kyle Bai’s GitHub repository demonstrates a simple TensorFlow‑on‑Kubernetes setup using NodePort‑exposed Jupyter Notebook. The demo includes an In‑Graph cluster with a sample master_client.ipynb notebook.

6. Thinking (Q&A)

Q: How to recycle PS pods after training? A: A DevOps TaaS module watches job completions; when all workers finish, it waits 30 seconds and deletes the PS Deployment/Service via the Kubernetes API.

Q: How to checkpoint when PS is stateful? A: Workers use tf.train.Saver to fetch parameters from PS tasks and persist checkpoints.

Q: How to generate Kubernetes YAML from a few user parameters? A: A Jinja2 template (see code block) is used to render the necessary Service, Job, Deployment, and PersistentVolumeClaim resources.

7. Jinja2 Template Example

{% raw %}
{% set name = "imagenet" %}  # algorithm name
{% set worker_replicas = 3 %}  # number of workers
{% set ps_replicas = 2 %}      # number of PS
{% set script = "http://xxx.xx.xx.xxx:80/imagenet/imagenet.py" %}  # script URL

{% set image = "tensorflow/tensorflow:1.3.0" %}
{% set data_dir = "/data" %}
{% set log_dir = "/log" %}
{% set port = 2222 %}
{% set replicas = {"worker": worker_replicas, "ps": ps_replicas} %}

{% macro worker_hosts() %}
{% for i in range(worker_replicas) %}{{ name }}-worker-{{ i }}:{{ port }}{% if not loop.last %}, {% endif %}{% endfor %}
{% endmacro %}

{% macro ps_hosts() %}
{% for i in range(ps_replicas) %}{{ name }}-ps-{{ i }}:{{ port }}{% if not loop.last %}, {% endif %}{% endfor %}
{% endmacro %}

{% for job in ["worker", "ps"] %}
{% for i in range(replicas[job]) %}
kind: Service
apiVersion: v1
metadata:
  name: {{ name }}-{{ job }}-{{ i }}
spec:
  selector:
    name: {{ name }}
    job: {{ job }}
    task: "{{ i }}"
  ports:
  - port: {{ port }}
    targetPort: 2222
{% if job == "worker" %}
---
kind: Job
apiVersion: batch/v1
metadata:
  name: {{ name }}-{{ job }}-{{ i }}
spec:
  template:
    metadata:
      labels:
        name: {{ name }}
        job: {{ job }}
        task: "{{ i }}"
    spec:
      containers:
      - name: {{ name }}-{{ job }}-{{ i }}
        image: {{ image }}
        ports:
        - containerPort: 2222
        command: ["/bin/sh", "-c"]
        args:["curl {{ script }} -o /opt/{{ name }}.py; python /opt/{{ name }}.py \
          --ps_hosts={{ ps_hosts() }} \
          --worker_hosts={{ worker_hosts() }} \
          --job_name={{ job }} \
          --task_index={{ i }} \
          --log_path={{ log_dir }} \
          --data_dir={{ data_dir }} ;"]
        volumeMounts:
        - name: data
          mountPath: {{ data_dir }}
        - name: log
          mountPath: {{ log_dir }}
      restartPolicy: Never
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: {{ name }}-data-pvc
      - name: log
        persistentVolumeClaim:
          claimName: {{ name }}-log-pvc
{% endif %}
{% if job == "ps" %}
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: {{ name }}-{{ job }}-{{ i }}
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: {{ name }}
        job: {{ job }}
        task: "{{ i }}"
    spec:
      containers:
      - name: {{ name }}-{{ job }}-{{ i }}
        image: {{ image }}
        ports:
        - containerPort: 2222
        command: ["/bin/sh", "-c"]
        args:["curl {{ script }} -o /opt/{{ name }}.py; python /opt/{{ name }}.py \
          --ps_hosts={{ ps_hosts() }} \
          --worker_hosts={{ worker_hosts() }} \
          --job_name={{ job }} \
          --task_index={{ i }} \
          --log_path={{ log_dir }} ;"]
        volumeMounts:
        - name: log
          mountPath: {{ log_dir }}
      restartPolicy: Never
      volumes:
      - name: log
        persistentVolumeClaim:
          claimName: {{ name }}-log-pvc
{% endif %}
---
{% endfor %}
{% endfor %}

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: {{ name }}-log-pvc
annotations:
  volume.beta.kubernetes.io/storage-class: glusterfs
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: {{ name }}-data-pvc
annotations:
  volume.beta.kubernetes.io/storage-class: glusterfs
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
{% endraw %}

Running python render_template.py tfcluster_template.yaml.jinja | kubectl apply -f - creates the Between‑Graph TensorFlow cluster.

8. Summary

Combining TensorFlow and Kubernetes unlocks the full power of Distributed TensorFlow. The article provides a practical overview, architecture diagrams, component lists, a demo, and a Jinja2 template for automated deployment. Future work includes custom scheduling, network I/O tuning, TaaS development, and rapid TensorFlow Serving deployment.

machine learningKubernetesdevopsTensorFlowDistributed ComputingGlusterFS
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.