Tag

AI infrastructure

2 views collected around this technical thread.

DataFunTalk
DataFunTalk
Jun 15, 2025 · Artificial Intelligence

Sam Altman Reveals the ‘Stargate’ AI Infrastructure Blueprint and Its $500B Future

In a Bloomberg Originals interview, OpenAI CEO Sam Altman discusses the massive “Stargate” infrastructure project, exploding demand for AI compute, multi‑partner collaborations, a projected $500 billion investment, GPU bottlenecks, and his vision for AI’s role in science, employment and humanity’s future.

AI fundingAI futureAI infrastructure
0 likes · 25 min read
Sam Altman Reveals the ‘Stargate’ AI Infrastructure Blueprint and Its $500B Future
Baidu Geek Talk
Baidu Geek Talk
Apr 14, 2025 · Artificial Intelligence

PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development

PaddlePaddle Framework 3.0 delivers five breakthroughs—dynamic‑static unified automatic parallelism, integrated training‑inference pipelines, high‑order scientific differentiation, a neural‑network compiler with automatic operator fusion, and streamlined heterogeneous chip adaptation—drastically reducing development effort, boosting training speed, and expanding compatibility for large‑scale AI models.

AI infrastructureDistributed TrainingLarge Language Models
0 likes · 23 min read
PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development
AntData
AntData
Mar 5, 2025 · Cloud Native

DeepSeek 3FS Network Communication Module: Design, Implementation, and Impact on AI Infrastructure

This article provides an in‑depth analysis of DeepSeek's open‑source 3FS distributed storage system, focusing on its network communication module, RDMA‑based design, core classes such as IBSocket, Listener, and IOWorker, and how these innovations advance high‑performance AI infrastructure.

AI infrastructureDistributed StorageFolly Coroutines
0 likes · 15 min read
DeepSeek 3FS Network Communication Module: Design, Implementation, and Impact on AI Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 20, 2025 · Cloud Computing

2024 Alibaba Cloud Infrastructure Network Team: AI‑Scale Network Innovations, Academic Achievements, Open‑Source Contributions and Industry Outreach

The 2024 report of Alibaba Cloud's Infrastructure Network team details AI‑driven network breakthroughs, high‑performance protocol stacks, large‑scale monitoring systems, numerous top‑conference paper acceptances, open‑source ecosystem initiatives, and extensive industry outreach, highlighting the evolving AI infra landscape.

AI infrastructureConference PapersHigh Performance Computing
0 likes · 19 min read
2024 Alibaba Cloud Infrastructure Network Team: AI‑Scale Network Innovations, Academic Achievements, Open‑Source Contributions and Industry Outreach
DataFunSummit
DataFunSummit
Dec 30, 2024 · Artificial Intelligence

Colossal-AI: A Scalable Framework for Distributed Training of Large Models

This presentation introduces the challenges of the large‑model era, describes the Colossal‑AI architecture—including N‑dimensional parallelism, heterogeneous storage, and zero‑code experience—shows benchmark results and real‑world use cases, and answers audience questions about its integration with PyTorch and advanced parallel strategies.

AI infrastructureBenchmarkColossal-AI
0 likes · 11 min read
Colossal-AI: A Scalable Framework for Distributed Training of Large Models
DataFunSummit
DataFunSummit
Dec 24, 2024 · Artificial Intelligence

Considerations and Practices for Domesticating Large‑Model Inference Engines

This article examines the importance of domestic large‑model inference engines, compares Chinese and international chips, evaluates four architectural approaches, discusses practical challenges such as performance loss and model support, and outlines future expectations for high‑performance, heterogeneous‑chip inference solutions.

AI infrastructureDomestic ChipInference Engine
0 likes · 9 min read
Considerations and Practices for Domesticating Large‑Model Inference Engines
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 29, 2024 · Artificial Intelligence

Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University

In June 2024, Alibaba Cloud and Tsinghua University's MADSys Lab announced the open‑source Mooncake architecture, a KVCache‑centered large‑model inference framework that boosts throughput, lowers cost, and standardizes resource‑pooling techniques for high‑performance AI inference across industry and academia.

AI infrastructureAlibaba CloudKVCache
0 likes · 4 min read
Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University
DevOps
DevOps
Nov 27, 2024 · Artificial Intelligence

Elon Musk’s Colossus Supercomputer: Building 100,000 GPUs in 122 Days and Its Impact on AI Infrastructure

The article analyzes Elon Musk’s Colossus AI supercomputer—its 100,000 NVIDIA H100 GPUs, record‑fast 122‑day construction, vertical‑integration strategy, and the broader implications for U.S. AI infrastructure dominance and China’s competing challenges in funding and chip supply.

AI StrategyAI infrastructureElon Musk
0 likes · 13 min read
Elon Musk’s Colossus Supercomputer: Building 100,000 GPUs in 122 Days and Its Impact on AI Infrastructure
Baidu Geek Talk
Baidu Geek Talk
Oct 30, 2024 · Cloud Computing

Baidu Cloud Infrastructure for AI-Native Era

Baidu Intelligent Cloud outlines how its evolving, high-performance infrastructure—featuring rapid 3-minute instance provisioning, over 200 GB bandwidth, elastic computing, specialized storage, and AI-driven MLOps tools—enables AI-native model training and deployment across booming sectors such as automotive and finance, supporting the industry’s shift to AI-centric cloud services.

AI infrastructureCase StudiesDistributed Systems
0 likes · 9 min read
Baidu Cloud Infrastructure for AI-Native Era
360 Tech Engineering
360 Tech Engineering
Oct 15, 2024 · Artificial Intelligence

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

The article details the design and deployment of 360's AI Compute Center, covering GPU server selection, high‑performance networking, Kubernetes‑based cluster management, advanced scheduling, training and inference acceleration techniques, and a comprehensive AI development platform with visualization and fault‑tolerance features.

AI infrastructureDistributed ComputingGPU cluster
0 likes · 21 min read
Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 12, 2024 · Fundamentals

Alibaba Cloud Server R&D Team Publishes Three Papers on High‑Density PCIe 6.0, 100G‑PAM4 Ethernet, and Immersion‑Cooling PCB Materials at IEEE EPEPS 2024 and PCB West 2024

Alibaba Cloud's server R&D team presented three research papers at IEEE EPEPS 2024 and PCB West 2024 covering high‑density PCIe 6.0 crosstalk optimization, 100G‑PAM4 Ethernet performance under air and immersion cooling, and sustainable low‑cost PCB materials for immersion‑cooled computer systems, highlighting their relevance to AI infrastructure and data‑center design.

AI infrastructurePCB MaterialsPCIe 6.0
0 likes · 10 min read
Alibaba Cloud Server R&D Team Publishes Three Papers on High‑Density PCIe 6.0, 100G‑PAM4 Ethernet, and Immersion‑Cooling PCB Materials at IEEE EPEPS 2024 and PCB West 2024
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 11, 2024 · Artificial Intelligence

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

This article details the design and implementation of 360’s AI Computing Center, covering server selection, network topology, Kubernetes scheduling, training and inference acceleration, and the AI platform’s core, visualization, and fault‑tolerance capabilities for large‑scale AI workloads.

AI infrastructureDistributed TrainingGPU cluster
0 likes · 22 min read
How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling
DataFunSummit
DataFunSummit
Sep 24, 2024 · Artificial Intelligence

Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training

The article discusses the challenges of training ever‑larger AI models on internet‑scale data, critiques traditional batch ETL pipelines, and proposes a streaming data‑flow architecture with dynamic data selection and a shared‑memory/Alluxio middle layer to decouple data processing from model training, improving efficiency and scalability.

AI infrastructureLarge Modelsdata pipelines
0 likes · 20 min read
Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Sep 19, 2024 · Operations

How TAI Platform Optimizes Large‑Model Scheduling and Fault Recovery on Kubernetes

This article explains how the TAI platform leverages Kubernetes and Volcano to tackle fault, efficiency, and usability challenges in large‑model training and inference, detailing custom resources, automated fault detection, and advanced scheduling strategies that boost resource utilization and performance.

AI infrastructureKubernetesLarge Models
0 likes · 9 min read
How TAI Platform Optimizes Large‑Model Scheduling and Fault Recovery on Kubernetes
AntTech
AntTech
Sep 15, 2024 · Artificial Intelligence

Dr. Wang Jian’s Keynote on AI, AI+, and AI Infrastructure at the 2024 Inclusion·Bund Conference

In his 2024 Inclusion·Bund Conference keynote, Dr. Wang Jian traces the short yet intense history of artificial intelligence, explains the emergence of AI+, discusses the pivotal role of transformer‑based models and AI infrastructure, and reflects on how cloud computing and innovative business models are reshaping the AI ecosystem.

AIAI infrastructureAI+
0 likes · 16 min read
Dr. Wang Jian’s Keynote on AI, AI+, and AI Infrastructure at the 2024 Inclusion·Bund Conference
Architects' Tech Alliance
Architects' Tech Alliance
Sep 8, 2024 · Artificial Intelligence

Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training

The article surveys the network architectures and congestion‑control techniques used in massive GPU clusters—such as Byte’s megascale, Baidu HPN, Alibaba HPN7, and Tencent Xingmai 2.0—highlighting how high‑bandwidth, low‑latency designs and advanced RDMA technologies enable training of trillion‑parameter multimodal AI models.

AI infrastructureData CenterGPU clusters
0 likes · 11 min read
Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training
DataFunSummit
DataFunSummit
Aug 24, 2024 · Databases

Cloud‑Native Storage Solutions for Large‑Scale Vector Data with Milvus and Zilliz

This article presents a comprehensive overview of Zilliz’s cloud‑native vector database ecosystem, detailing Milvus’s distributed architecture, indexing and query capabilities, related tools such as Towhee and GPTCache, storage challenges, tiered storage designs, performance metrics, and real‑world AI use cases like code‑assist and RAG‑based Q&A systems.

AI infrastructureANN SearchLarge Scale Storage
0 likes · 21 min read
Cloud‑Native Storage Solutions for Large‑Scale Vector Data with Milvus and Zilliz
DataFunTalk
DataFunTalk
Jul 8, 2024 · Artificial Intelligence

Challenges and Techniques for Distributed Training of Large Language Models

This article discusses the historical background, major challenges such as massive compute and memory demands, and the technical ecosystem—including data parallelism, pipeline parallelism, and optimization strategies like DeepSpeed and 1F1B—to enable efficient distributed training of large language models.

AI infrastructureDeepSpeedDistributed Training
0 likes · 22 min read
Challenges and Techniques for Distributed Training of Large Language Models
Baidu Tech Salon
Baidu Tech Salon
May 15, 2024 · Artificial Intelligence

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM

Baidu Baige’s AIAK‑LLM suite accelerates large‑model training and inference by boosting Model FLOPS Utilization through techniques such as TP communication overlap, hybrid recompute, zero‑offload, automatic parallel‑strategy search, multi‑chip support, and inference‑specific optimizations, achieving over 60 % speedup and seamless Hugging Face integration.

AI infrastructureAIAK-LLMBaidu Baige
0 likes · 26 min read
Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM