Artificial Intelligence 21 min read

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

The article details the design and deployment of 360's AI Compute Center, covering GPU server selection, high‑performance networking, Kubernetes‑based cluster management, advanced scheduling, training and inference acceleration techniques, and a comprehensive AI development platform with visualization and fault‑tolerance features.

360 Tech Engineering

Oct 15, 2024

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

360 AI Compute Center integrates artificial intelligence, heterogeneous computing, big data, high‑performance networking, and an AI development platform to provide efficient, intelligent compute resources for complex AI workloads.

Infrastructure Construction

Server selection : Each node uses dual CPUs, dual storage NICs, four PCIe Gen4 switches, six NVSwitch chips, eight GPUs (A100/A800), and four InfiniBand NICs. Storage NICs (25 Gb/s) are bonded (bond4) to achieve 50 Gb/s aggregate bandwidth, and checkpoint I/O is optimized via distributed storage and multi‑stage asynchronous saving, reducing a 7B model checkpoint time from 383 s to 5 s (≈70× speed‑up).

GPU interconnect: Eight GPUs are fully connected via six NVSwitch chips, providing 600 GB/s (A100) or 400 GB/s (A800) NVLink bandwidth, which is not a bottleneck for thousand‑GPU training.

Network cards: Four 200 Gb/s Mellanox CX6 NICs are used; GPU‑Direct RDMA can improve training speed by up to 50 % (verified with nvidia-smi topo -m).

Network Construction

The center adopts a predominantly east‑west traffic model, using a SuperPod‑style architecture with 200 A800 nodes per Scalable Unit (SU) and a leaf‑spine full‑mesh topology. Scaling beyond 200 nodes requires adding a Core Compute Switch layer.

Kubernetes Cluster Construction

Based on Volcano, the cluster implements advanced scheduling strategies:

Gang scheduling for "all‑or‑nothing" job placement.

Bin‑Pack to minimize resource fragmentation.

Priority and pre‑emption (P0‑P5 levels).

Network‑topology‑aware scheduling to place communicating tasks on low‑latency links.

Delay scheduling to avoid starvation of large‑resource jobs.

Heterogeneous compute scheduling for NVIDIA GPUs, Ascend chips, X86 and ARM architectures.

Network plugins include VPC for management plane, Multus‑CNI, macvlan, and secondary CNI for data plane. The NVIDIA network‑operator provides mofed, rdma‑shared‑device‑plugin, and secondary CNI components (Multus‑cni, container‑networking‑plugins, whereabouts) to enable RoCE v2 and InfiniBand.

Training Acceleration (QLM)

Qihoo Large Language Model (QLM) is a Megatron‑LM‑based framework optimized for the center’s thousand‑GPU cluster, achieving >47 % MFU on MoE models and >56 % on dense models, with dense model training speed of 175 TFLOPS (≈8× improvement). It supports thousand‑GPU training, HF model compatibility, visual monitoring, profiling, evaluation, and flexible fine‑tuning.

Inference Acceleration (GLLM)

Gaia Large Language Module (GLLM) is a multi‑platform inference engine (NVIDIA, Ascend, etc.) that outperforms VLLM by >10 % using continuous batching, PageAttention, and PrefixCache to improve latency and memory efficiency for long‑context workloads.

AI Platform Construction

The platform offers interactive modeling (Jupyter, VSCode), distributed training (3D parallelism, auto‑scaling, fault‑self‑healing), online deployment with auto‑scaling, resource pool management, and optimization features (task prioritization, queue limits, pre‑emption) that raise overall resource utilization by >25 %.

Visualization capabilities include cluster‑resource dashboards, training‑resource monitors (GPU/CPU usage, temperature, power), task‑level metrics, and training‑process visualizations (loss curves, hyper‑parameter comparison, gradient monitoring).

Fault Tolerance

QihooSMI provides automated detection and self‑healing for runtime environment issues (e.g., nvidia-fabricmanager service), hardware failures (GPU ECC, NIC faults), network anomalies (RoCE/IB via UFM, Prometheus/Grafana), and slow nodes (lightweight NCCL‑Tests detection within 5 minutes).

Conclusion and Outlook

The 360 AI Compute Center demonstrates a comprehensive, high‑performance AI infrastructure that integrates cutting‑edge hardware, networking, scheduling, and platform capabilities, and will continue to scale, enhance heterogeneous support, and improve visualization and automation to meet future AI workload demands.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Inference Acceleration Distributed computing AI infrastructure Training Acceleration GPU cluster

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.