Tag

GPU Cluster

1 views collected around this technical thread.

ByteDance Cloud Native
ByteDance Cloud Native
Mar 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

AIBrixDeepSeek-R1Distributed Inference
0 likes · 14 min read
How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours
DevOps
DevOps
Nov 27, 2024 · Artificial Intelligence

Elon Musk’s Colossus Supercomputer: Building 100,000 GPUs in 122 Days and Its Impact on AI Infrastructure

The article analyzes Elon Musk’s Colossus AI supercomputer—its 100,000 NVIDIA H100 GPUs, record‑fast 122‑day construction, vertical‑integration strategy, and the broader implications for U.S. AI infrastructure dominance and China’s competing challenges in funding and chip supply.

AI StrategyAI infrastructureElon Musk
0 likes · 13 min read
Elon Musk’s Colossus Supercomputer: Building 100,000 GPUs in 122 Days and Its Impact on AI Infrastructure
360 Tech Engineering
360 Tech Engineering
Oct 15, 2024 · Artificial Intelligence

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

The article details the design and deployment of 360's AI Compute Center, covering GPU server selection, high‑performance networking, Kubernetes‑based cluster management, advanced scheduling, training and inference acceleration techniques, and a comprehensive AI development platform with visualization and fault‑tolerance features.

AI infrastructureGPU ClusterInference Acceleration
0 likes · 21 min read
Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 11, 2024 · Artificial Intelligence

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

This article details the design and implementation of 360’s AI Computing Center, covering server selection, network topology, Kubernetes scheduling, training and inference acceleration, and the AI platform’s core, visualization, and fault‑tolerance capabilities for large‑scale AI workloads.

AI infrastructureGPU ClusterKubernetes
0 likes · 22 min read
How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling
Baidu Geek Talk
Baidu Geek Talk
May 10, 2023 · Artificial Intelligence

Baidu's AI Infrastructure for Large-Scale LLM Training: Architecture, Challenges, and Optimization

Baidu’s AI infrastructure combines a massive InfiniBand‑linked GPU cluster, Kunlun chips, the PaddlePaddle framework, and the Wenxin model suite with 4D hybrid parallelism, elastic fault tolerance, and a two‑stage training pipeline to overcome computation, memory, and communication walls, delivering world‑leading MLPerf performance for large‑scale LLMs.

AI infrastructureGPU ClusterInfiniBand
0 likes · 15 min read
Baidu's AI Infrastructure for Large-Scale LLM Training: Architecture, Challenges, and Optimization
Tencent Cloud Developer
Tencent Cloud Developer
Apr 14, 2023 · Artificial Intelligence

Tencent Cloud's Next-Generation HCC High-Performance Computing Cluster for Large Model Training

Tencent Cloud's new HCC high‑performance computing cluster triples previous generation performance with 3.2 TB/s server bandwidth, Xingsha servers and NVIDIA H800 GPUs delivering up to 1979 TFlops, while its Xingmai 3.2 T ETH RDMA network, TB‑level storage via COS + GooseFS, and multi‑form access (bare metal, cloud servers, containers, functions) enable efficient large‑model training.

AI computingGPU ClusterHigh Performance Computing
0 likes · 9 min read
Tencent Cloud's Next-Generation HCC High-Performance Computing Cluster for Large Model Training
DataFunSummit
DataFunSummit
Mar 9, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's multimodal content understanding platform, covering its massive data challenges, heterogeneous model support, standardized pipelines, platformization, workflow architecture, GPU heterogeneous cluster management, resource scheduling, performance optimization, and full‑stack monitoring to achieve stable, low‑latency AI services at scale.

GPU ClusterWeibodistributed training
0 likes · 18 min read
Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions