Tag

AI training

1 views collected around this technical thread.

Architect
Architect
May 26, 2025 · Artificial Intelligence

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

This article explains the memory limits of a single GPU and systematically introduces data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism, describing their communication costs, advantages, drawbacks, and practical implementation details for training large AI models.

AI trainingData ParallelismLarge Language Models
0 likes · 14 min read
Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism
DataFunTalk
DataFunTalk
Mar 24, 2025 · Artificial Intelligence

DeepSeek R1: Open‑Source Reasoning Model and Multi‑Stage Training Insights

The interview explores DeepSeek R1's open‑source weights, its multi‑stage training pipeline—including pre‑training, supervised fine‑tuning, and RLHF—alongside innovations such as self‑consistency, chain‑of‑thought prompting, distillation, MoE architectures, and cost considerations, highlighting its impact on the future of large language models.

AI trainingChain-of-ThoughtDeepSeek
0 likes · 20 min read
DeepSeek R1: Open‑Source Reasoning Model and Multi‑Stage Training Insights
DataFunSummit
DataFunSummit
Mar 20, 2025 · Artificial Intelligence

Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training

The article traces the evolution of AI training stability from early manual operations on small GPU clusters to sophisticated, fault‑tolerant infrastructures for thousand‑card and ten‑thousand‑card models, detailing Baidu Baige’s metrics, monitoring, eBPF‑based diagnostics, and checkpoint strategies that reduce invalid training time and accelerate fault recovery.

AI trainingLarge-Scale Trainingcheckpointing
0 likes · 22 min read
Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training
DataFunTalk
DataFunTalk
Feb 18, 2025 · Artificial Intelligence

CODEI/O: Leveraging Code to Train Large Language Models for Enhanced Reasoning

The DeepSeek team introduced CODEI/O, a massive dataset that converts code into natural‑language reasoning chains, and demonstrated that training large language models on this data markedly improves their performance on diverse inference tasks, including non‑code domains, through a two‑stage training strategy.

AI trainingCODEI/OLarge Language Models
0 likes · 8 min read
CODEI/O: Leveraging Code to Train Large Language Models for Enhanced Reasoning
Architects' Tech Alliance
Architects' Tech Alliance
Aug 18, 2024 · Artificial Intelligence

RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training

The article explains how RDMA technologies—including InfiniBand, RoCE, and iWARP—provide high‑throughput, low‑latency, CPU‑free data transfer for massive generative AI model training, compares their architectures, and discusses modern network designs and load‑balancing strategies to optimize AI‑focused data‑center networks.

AI trainingHigh Performance ComputingInfiniBand
0 likes · 11 min read
RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training
DataFunSummit
DataFunSummit
Jul 23, 2024 · Big Data

Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains

This article presents Xiaohongshu's multi‑cloud unified data acceleration layer built with Alluxio, detailing the challenges of multi‑cloud architectures, the design goals, Alluxio's architecture and features, real‑world case studies in AI training and recommendation indexing, performance improvements, and future plans.

AI trainingAlluxioBig Data
0 likes · 22 min read
Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains
Architects' Tech Alliance
Architects' Tech Alliance
Jul 7, 2024 · Operations

Overview of Popular GPU/TPU Cluster Networking Technologies: NVLink, InfiniBand, RoCE, and DDC

This article reviews the main GPU/TPU cluster networking solutions—including NVLink, InfiniBand, RoCE Ethernet, and DDC full‑schedule fabrics—examining their latency, loss‑free transmission, congestion control, cost, scalability, and suitability for large‑scale LLM training workloads.

AI trainingDDCGPU networking
0 likes · 16 min read
Overview of Popular GPU/TPU Cluster Networking Technologies: NVLink, InfiniBand, RoCE, and DDC
DataFunSummit
DataFunSummit
Jun 20, 2024 · Big Data

Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions

This article presents a comprehensive overview of modern Data+AI data lake challenges and solutions, covering the evolution of data lakes, an introduction to Apache Iceberg, practical use of PyIceberg for AI training and inference pipelines, and advanced vector table and indexing techniques for efficient similarity search.

AI trainingApache IcebergBig Data
0 likes · 22 min read
Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions
DataFunTalk
DataFunTalk
Jun 14, 2024 · Artificial Intelligence

Midjourney’s Diverse Data Sources: Public Datasets, Academic Research, Partner and Proprietary Data

Midjourney enhances its AI models by integrating a wide range of data sources—including public datasets like ImageNet and COCO, academic research from top conferences, partner collaborations, and its own proprietary data—while continuously updating and managing these datasets for quality, privacy, and security.

AI trainingBright DataCOCO
0 likes · 9 min read
Midjourney’s Diverse Data Sources: Public Datasets, Academic Research, Partner and Proprietary Data
Architects' Tech Alliance
Architects' Tech Alliance
May 15, 2024 · Artificial Intelligence

Detailed Overview of GPU Server Architectures: A100/A800 and H100/H800 Nodes

This article provides a comprehensive technical overview of large‑scale GPU server architectures, detailing the component topology of 8‑GPU A100/A800 and H100/H800 nodes, explaining storage network cards, NVSwitch interconnects, bandwidth calculations, and the trade‑offs between RoCEv2 and InfiniBand for AI workloads.

AI trainingGPUHigh Performance Computing
0 likes · 13 min read
Detailed Overview of GPU Server Architectures: A100/A800 and H100/H800 Nodes
IT Services Circle
IT Services Circle
May 13, 2024 · Information Security

The Hidden Costs and Ineffectiveness of CAPTCHAs

CAPTCHAs, originally designed as human‑based computation tools to block bots, have become costly, discriminatory, and largely ineffective security measures that waste billions of dollars annually while providing profit to service providers, prompting a 2024 debate on their continued use.

AI trainingAccessibilityHuman Computation
0 likes · 8 min read
The Hidden Costs and Ineffectiveness of CAPTCHAs
360 Smart Cloud
360 Smart Cloud
Apr 25, 2024 · Cloud Native

Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training

This article explains how to construct high‑performance RoCE v2 and InfiniBand networks within a cloud‑native Kubernetes environment, detailing the underlying technologies, required components, configuration steps, and performance test results that demonstrate significant communication speed improvements for large‑scale AI model training.

AI trainingCloud NativeInfiniBand
0 likes · 12 min read
Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training
Model Perspective
Model Perspective
Mar 16, 2024 · Artificial Intelligence

What Watching a TV Drama Reveals About AI Model Training and Learning Strategies

The article draws parallels between expert viewers dissecting the drama "The Legend of Zhen Huan," efficient paper‑reading techniques, and the active‑prediction plus contrast‑learning approach that underpins modern AI model training, highlighting how proactive thinking boosts both personal and machine learning outcomes.

AI trainingLarge Language Modelsactive learning
0 likes · 8 min read
What Watching a TV Drama Reveals About AI Model Training and Learning Strategies
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jun 16, 2023 · Cloud Computing

Predictable Network and High‑Performance Network Architecture for Large‑Scale AI Training

The article examines how Alibaba Cloud’s Predictable Network, InfiniBand versus Ethernet trade‑offs, and the HPN high‑performance network design together address the extreme bandwidth, latency, scalability and reliability requirements of modern large‑model AI training workloads in cloud data centers.

AI trainingEthernetHigh Performance Computing
0 likes · 24 min read
Predictable Network and High‑Performance Network Architecture for Large‑Scale AI Training
DataFunTalk
DataFunTalk
May 25, 2023 · Artificial Intelligence

Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD

This article examines the storage bottlenecks in large‑scale AI training, evaluates local‑disk and Alluxio‑based distributed caching strategies, proposes uniform cache eviction and replica‑aware global policies, and introduces the SiloD framework for coordinated compute‑storage scheduling to dramatically improve GPU utilization and overall cluster throughput.

AI trainingAlluxioCache Eviction
0 likes · 16 min read
Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD
Tencent Cloud Developer
Tencent Cloud Developer
Mar 22, 2023 · Artificial Intelligence

Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training

Tencent’s Star Network delivers a 1.6 Tbps Ethernet‑RDMA fabric, fat‑tree topology supporting up to 4 K GPUs, multi‑track traffic aggregation and adaptive heterogeneous links plus a custom TCCL library, cutting AllReduce overhead from 35 % to 3.7 %, speeding AI training iterations by 32 % while automating deployment and providing sub‑second self‑healing.

AI trainingGPU clustersRDMA
0 likes · 19 min read
Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training
AntTech
AntTech
Oct 9, 2022 · Cloud Computing

Sky Computing: A Multi‑Cloud Computing Platform for Transparent Resource Utilization

Sky Computing, introduced by Ant Technology Research Institute, proposes a cloud‑agnostic platform that abstracts heterogeneous public and private clouds into a unified service layer, enabling applications to seamlessly migrate workloads across clouds, reduce costs, avoid vendor lock‑in, and support AI training via the SkyML prototype.

AI trainingCost OptimizationMulti-Cloud
0 likes · 54 min read
Sky Computing: A Multi‑Cloud Computing Platform for Transparent Resource Utilization
Tencent Architect
Tencent Architect
Feb 23, 2021 · Artificial Intelligence

Analysis and Optimization of CephFS I/O Performance for AI Training on the Xingchen Compute Platform

This article investigates why AI training tasks on Tencent's Xingchen compute platform experience severe I/O slowdown when using CephFS, analyzes the underlying Ceph‑FUSE and MDS mechanisms, and proposes metadata‑caching and file‑caching optimizations that can accelerate training speed by three to four times.

AI trainingCeph-FUSECephFS
0 likes · 21 min read
Analysis and Optimization of CephFS I/O Performance for AI Training on the Xingchen Compute Platform
Architects' Tech Alliance
Architects' Tech Alliance
Dec 24, 2019 · Fundamentals

Design Considerations and Benefits of Storage Class Memory (SCM) for Data‑Intensive Applications

The article examines the emerging Storage Class Memory (SCM) market, outlines its various technologies, discusses performance and cost trade‑offs, and highlights how SCM can accelerate AI training, enable fast data recovery, reduce data‑center power consumption, and presents the challenges of latency and system integration.

AI trainingPerformanceSCM
0 likes · 15 min read
Design Considerations and Benefits of Storage Class Memory (SCM) for Data‑Intensive Applications
iQIYI Technical Product Team
iQIYI Technical Product Team
Jan 4, 2019 · Artificial Intelligence

Building a Deep Learning Training Platform on Cloud: Challenges, Runonce Service, and Storage Optimization

iQIYI built a cloud‑based deep‑learning training platform called Jarvis, replacing the initial Runonce service, by containerizing GPU tasks, adopting Ceph S3 storage with FUSE, optimizing data pipelines, and addressing compute, storage, and networking challenges to improve scalability and reduce GPU idle time.

AI trainingCloud PlatformContainerization
0 likes · 9 min read
Building a Deep Learning Training Platform on Cloud: Challenges, Runonce Service, and Storage Optimization