Tagged articles

38 articles

Page 1 of 1

May 7, 2026 · Artificial Intelligence

How Unsloth and NVIDIA Boost Consumer‑GPU LLM Training by ~25% with Three Simple Optimizations

Unsloth and NVIDIA identified three low‑level bottlenecks in LLM fine‑tuning on consumer GPUs—repeated packed‑sequence metadata construction, serialized copy‑and‑compute during gradient checkpointing, and per‑expert routing overhead in MoE—and applied targeted patches that together deliver roughly a 25% speedup without changing hardware, code, or frameworks.

GPU optimizationLLM trainingMixture of Experts

0 likes · 12 min read

How Unsloth and NVIDIA Boost Consumer‑GPU LLM Training by ~25% with Three Simple Optimizations

Machine Learning Algorithms & Natural Language Processing

May 1, 2026 · Artificial Intelligence

What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning

The article analyzes DeepSeek V4’s post‑training pipeline, explains how multi‑expert on‑policy distillation (OPD) differs from traditional teacher‑forcing, compares reverse‑KL and forward‑KL objectives, and uses analogies to human learning to illustrate the benefits and limits of OPD.

DeepSeek V4LLM trainingMulti-Expert Models

0 likes · 11 min read

What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning

Old Zhang's AI Learning

Apr 23, 2026 · Artificial Intelligence

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

DeepSeek has released TileKernels, a GPU kernel library written in the TileLang DSL, that targets H100/H200/B200 GPUs and claims to approach hardware limits in compute intensity and memory bandwidth, offering MoE routing, FP8/FP4 quantization, and dual‑language PyTorch references for deep‑learning engineers.

FP8 quantizationGPU optimizationLLM training

0 likes · 9 min read

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

Wu Shixiong's Large Model Academy

Apr 20, 2026 · Artificial Intelligence

How to Build Multi‑Step Reasoning Training Data for Deep Research Agents

Standard QA datasets fall short for deep research tasks because they lack the multi‑step, dynamic reasoning required; this article explains why, outlines four data‑construction techniques—SailorFog‑QA, WebFrontier, WebShaper, E2HQA—details trajectory sampling, filtering, scale considerations, and interview‑ready explanations.

AI agentsLLM trainingMulti-step Reasoning

0 likes · 16 min read

How to Build Multi‑Step Reasoning Training Data for Deep Research Agents

Machine Learning Algorithms & Natural Language Processing

Mar 9, 2026 · Artificial Intelligence

Can Self‑Iterating AI Agents Run on a Single GPU? Karpathy’s Autoresearch Demo

Karpathy’s open‑source “autoresearch” project demonstrates how a compact LLM training environment on a single GPU can let an AI agent autonomously modify code, run five‑minute training experiments, evaluate improvements, and iteratively produce better models, illustrating a new research paradigm where AI conducts experiments while humans design the system.

AI research automationAutoResearchKarpathy

0 likes · 6 min read

Can Self‑Iterating AI Agents Run on a Single GPU? Karpathy’s Autoresearch Demo

Data Party THU

Jan 7, 2026 · Artificial Intelligence

Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It

A recent study reveals that the widely used KL regularization in LLM reinforcement learning (RLVR) is mathematically biased, leading to unstable training and poorer generalization, and shows that moving the KL term back to the reward with a simple K1 estimator can boost out‑of‑domain performance by up to 20%.

AI researchKL regularizationLLM training

0 likes · 10 min read

Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It

AI Insight Log

Jan 1, 2026 · Artificial Intelligence

Can DeepSeek’s mHC Architecture Break ResNet’s Decade-Long Dominance?

DeepSeek’s new paper “mHC: Manifold‑Constrained Hyper‑Connections” proposes a novel architecture that replaces traditional residual connections with mathematically constrained hyper‑connections, showing on a 27B model a modest 6.7 % training‑time increase but significant stability gains and superior performance on BBH, DROP and GSM8K benchmarks.

DeepSeekLLM trainingResNet

0 likes · 8 min read

Can DeepSeek’s mHC Architecture Break ResNet’s Decade-Long Dominance?

Wu Shixiong's Large Model Academy

Nov 15, 2025 · Artificial Intelligence

How to Build Robust Function Call Training Data for LLM Agents

This article explains why function call capabilities in large language model agents require dedicated training, outlines the four core abilities to teach, describes the structure and sources of effective training data, and compares lightweight LoRA fine‑tuning with full supervised fine‑tuning approaches.

Agent SystemsData GenerationLLM training

0 likes · 11 min read

How to Build Robust Function Call Training Data for LLM Agents

AI2ML AI to Machine Learning

Nov 3, 2025 · Artificial Intelligence

Smol Training Playbook: Secrets to Building World-Class LLMs

The article details the SmolLM3 3B‑parameter model, its architecture, dual‑mode inference, a three‑stage data‑curation strategy, rigorous ablation methods, preference optimisation (APO/DPO), model merging, and practical training‑stability tricks, offering a comprehensive guide for building high‑performing large language models.

APOLLM trainingcontext scaling

0 likes · 13 min read

Smol Training Playbook: Secrets to Building World-Class LLMs

Sohu Tech Products

Sep 10, 2025 · Artificial Intelligence

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

This article explains the GRPO algorithm, an improvement over PPO for large language model training that eliminates the value network, uses group‑relative advantage estimation, and offers flexible supervision, resulting in higher efficiency, stability, and performance on tasks such as mathematical reasoning.

GRPOLLM trainingPPO

0 likes · 16 min read

How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

Baobao Algorithm Notes

Aug 1, 2025 · Artificial Intelligence

Why Training Large Language Models Feels Like Alchemy—and How to Master It

This article breaks down the hardware bottlenecks of large‑scale LLM training, explains the Roofline performance model, arithmetic intensity, and how computation and communication costs interact on GPUs and TPUs, offering concrete formulas and examples for efficient scaling.

Arithmetic intensityDistributed computingGPU

0 likes · 12 min read

Why Training Large Language Models Feels Like Alchemy—and How to Master It

Ops Development & AI Practice

Jul 29, 2025 · Artificial Intelligence

How Ray Transforms Distributed Training for Large Language Models

In the era of data‑driven AI, Ray offers an open‑source unified compute framework that abstracts distributed system complexity, enabling developers to seamlessly scale Python code from a laptop to large GPU clusters, and provides the Ray AI Runtime (AIR) with libraries such as Ray Data, Train, Tune, and Serve to accelerate LLM training, hyper‑parameter tuning, and model serving.

AI RuntimeDistributed computingLLM training

0 likes · 10 min read

How Ray Transforms Distributed Training for Large Language Models

Alibaba Cloud Big Data AI Platform

Jul 17, 2025 · Artificial Intelligence

How ChunkFlow Boosts Long-Context Model Training Up to 4.5× Faster

The paper "Efficient Long Context Fine-tuning with Chunk Flow" introduces ChunkFlow, a training framework that reorganizes variable‑length sequences into fixed‑size chunks, achieving up to 4.53× speedup and more stable GPU memory usage for large language models.

Artificial IntelligenceChunkFlowGPU optimization

0 likes · 7 min read

How ChunkFlow Boosts Long-Context Model Training Up to 4.5× Faster

Baobao Algorithm Notes

Jul 17, 2025 · Artificial Intelligence

How QK-Clip Tames MaxLogit Explosions in Trillion‑Parameter LLMs

The article introduces QK-Clip, a lightweight per‑head weight‑clipping technique that uses the MaxLogit signal to prevent uncontrolled logit growth in massive LLMs, explains its design, compares it with prior methods, and shows that it stabilizes training without harming model performance.

Attention stabilityLLM trainingMaxLogit

0 likes · 15 min read

How QK-Clip Tames MaxLogit Explosions in Trillion‑Parameter LLMs

Tencent Technical Engineering

Mar 31, 2025 · Artificial Intelligence

Step-by-Step Guide to Local Training of DeepSeek R1 on Multi‑GPU A100 Systems

This step‑by‑step tutorial shows how to set up CUDA 12.4, install required packages, prepare a JSON dataset and custom reward, troubleshoot out‑of‑memory errors, and launch DeepSeek R1 training on an 8‑GPU A100 cluster using Accelerate, Deepspeed zero‑3 and vLLM configurations.

A100CUDADeepSeek

0 likes · 9 min read

Step-by-Step Guide to Local Training of DeepSeek R1 on Multi‑GPU A100 Systems

Architect

Mar 16, 2025 · Artificial Intelligence

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

This article walks through the complete lifecycle of building a small large‑language model, covering token‑level inference, pre‑training, post‑training steps such as supervised fine‑tuning, reward‑model creation, and reinforcement‑learning methods like DPO, PPO and GRPO, culminating in a practical 0.5B model fine‑tuned for chain‑of‑thought reasoning.

GRPOLLM trainingReward Modeling

0 likes · 22 min read

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

Architect

Feb 25, 2025 · Artificial Intelligence

DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM

DeepSeek R1 builds on the DeepSeek V3 base model using a multi‑stage reinforcement learning pipeline—including GRPO optimization, rule‑based reward modeling, supervised fine‑tuning, language‑consistency rewards, rejection sampling, and distillation—to produce a high‑performing, aligned LLM capable of accurate reasoning.

DeepSeekLLM trainingReward Modeling

0 likes · 24 min read

DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM

Architect

Feb 24, 2025 · Artificial Intelligence

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

The article details the development, architectural evolution, and practical challenges of MoBA—a sparse attention framework inspired by Mixture‑of‑Experts that scales LLM context length to 10 M tokens, supports seamless switching between full and sparse attention, and is now released as a minimal open‑source solution.

AI ArchitectureContext ParallelLLM training

0 likes · 13 min read

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

AI Algorithm Path

Feb 18, 2025 · Artificial Intelligence

Build DeepSeek‑R1 from Scratch: Complete Training Process with Code Walkthrough

This article provides a step‑by‑step, code‑first guide to reproducing DeepSeek‑R1 from the ground up, covering model selection, dataset preparation, custom reward functions, GRPO reinforcement‑learning training, supervised fine‑tuning, reasoning‑oriented RL, rejection sampling, and model distillation.

DeepSeek-R1LLM trainingPython

0 likes · 48 min read

Build DeepSeek‑R1 from Scratch: Complete Training Process with Code Walkthrough

Alibaba Cloud Big Data AI Platform

Jan 17, 2025 · Artificial Intelligence

How BladeDISC++ Cuts Memory Peaks for Dynamic‑Shape Deep Learning Models

This article explains the challenges of dynamic‑shape deep learning workloads and introduces BladeDISC++, an AI compiler that uses symbolic shape graphs, operation scheduling, and just‑in‑time auto‑rematerialization to dramatically reduce GPU memory peaks while maintaining training throughput.

AI compilerBladeDISC++LLM training

0 likes · 16 min read

How BladeDISC++ Cuts Memory Peaks for Dynamic‑Shape Deep Learning Models

Linux Kernel Journey

Dec 22, 2024 · Artificial Intelligence

Understanding GPU Monitoring: Utilization Metrics and Failure Scenarios

This article systematically reviews GPU monitoring for large‑scale AI training, covering MFU/HFU definitions, key DCGM metrics, NVLink bandwidth, common failure codes such as Xid and SXid, experimental insights on T4 and H100 GPUs, and practical case studies for diagnosing and mitigating performance drops.

DCGMGPU failuresGPU monitoring

0 likes · 26 min read

Understanding GPU Monitoring: Utilization Metrics and Failure Scenarios

Baobao Algorithm Notes

Oct 20, 2024 · Artificial Intelligence

Why Gradient Accumulation Isn’t Always Equivalent to Large‑Batch Training for LLMs

A recently discovered bug in popular LLM libraries shows that gradient accumulation can introduce significant accuracy loss compared to true large‑batch training, especially when sequence lengths vary, and the issue can be fixed by correcting the loss denominator scaling.

LLM trainingdeep learninggradient accumulation

0 likes · 6 min read

Why Gradient Accumulation Isn’t Always Equivalent to Large‑Batch Training for LLMs

Baobao Algorithm Notes

Sep 28, 2024 · Artificial Intelligence

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

This article provides a thorough, yet concise, overview of Llama 3’s training pipeline, data handling, model architecture, scaling laws, post‑training techniques like SFT and DPO, and inference optimizations such as KV‑Cache, GQA, PagedAttention, and FP8 quantization, highlighting practical insights and benchmark results.

DPOKV CacheLLM training

0 likes · 32 min read

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

Alibaba Cloud Big Data AI Platform

Sep 12, 2024 · Artificial Intelligence

How Pai‑Megatron‑Patch Boosts LLM Training with Offloading, FlashAttention‑3, and Communication Overlap

This article introduces Pai‑Megatron‑Patch, a suite of tools built on Nvidia Megatron‑LM that accelerates large language model training through dense and MoE model support, high‑precision HuggingFace↔MCore weight conversion, CPU offloading for optimizers and activations, FlashAttention‑3, and communication‑compute overlapping, and provides detailed experimental results and command‑line usage examples.

CPU offloadingCommunication OverlapDistributed Optimizer

0 likes · 22 min read

How Pai‑Megatron‑Patch Boosts LLM Training with Offloading, FlashAttention‑3, and Communication Overlap

NewBeeNLP

Jul 31, 2024 · Artificial Intelligence

Training 7B–13B LLMs: Practical Tips, Hyperparameters, and Scaling Challenges

The article shares hands‑on experience training 7‑ and 13‑billion‑parameter language models, covering essential hyper‑parameters, hardware requirements, data quality considerations, open dataset resources, and the systemic difficulties that arise when scaling to trillion‑parameter models.

LLM traininghyperparameterslarge language models

0 likes · 8 min read

Training 7B–13B LLMs: Practical Tips, Hyperparameters, and Scaling Challenges

Architects' Tech Alliance

Jul 30, 2024 · Artificial Intelligence

Unlocking 10K‑GPU LLM Training: Inside MegaScale’s 55% MFU Breakthrough

This article translates and analyzes the MegaScale system—co‑developed by ByteDance and Peking University—that enables efficient, stable training of massive language models on clusters of more than 10,000 GPUs, achieving 55.2% MFU and a 1.34× speedup over Megatron‑LM.

Distributed SystemsGPU scalingLLM training

0 likes · 15 min read

Unlocking 10K‑GPU LLM Training: Inside MegaScale’s 55% MFU Breakthrough

NewBeeNLP

Jun 12, 2024 · Artificial Intelligence

Beyond Cosine Decay: Fixed LR + Quick Decay Beats Traditional Schedules in LLM Training

The article analyzes why the traditional cosine decay learning‑rate schedule hinders continued training of large language models and shows that fixed‑learning‑rate strategies such as Warmup‑Stable‑Decay, Cooldown, SWA, and Schedule‑Free Optimizer can match or surpass cosine performance while being more friendly to fine‑tuning.

LLM trainingSFOSWA

0 likes · 7 min read

Beyond Cosine Decay: Fixed LR + Quick Decay Beats Traditional Schedules in LLM Training

Architects' Tech Alliance

Apr 23, 2024 · Industry Insights

Which GPU Cluster Network Wins for LLM Training? NVLink, InfiniBand, RoCE & DDC Compared

This article analyzes the main GPU/TPU cluster networking options—NVLink, InfiniBand, RoCE Ethernet, and DDC full‑schedule fabrics—examining latency, lossless transmission, congestion control, cost, power, and scalability to determine their suitability for large‑scale LLM training.

DDCData center fabricsGPU networking

0 likes · 18 min read

Which GPU Cluster Network Wins for LLM Training? NVLink, InfiniBand, RoCE & DDC Compared

NewBeeNLP

Apr 11, 2024 · Artificial Intelligence

How Karpathy Built a 1,000‑Line C LLM Trainer Without Any Deep‑Learning Framework

Andrej Karpathy released LLM.C, a pure C/CUDA implementation that trains GPT‑2‑style models in about 1,000 lines of code, detailing manual forward/backward passes, memory allocation tricks, SIMD CPU acceleration, CUDA porting, and migration tutorials, while comparing it to PyTorch and discussing broader LLM OS implications.

C ProgrammingCUDAGPT

0 likes · 6 min read

How Karpathy Built a 1,000‑Line C LLM Trainer Without Any Deep‑Learning Framework

Architects' Tech Alliance

Apr 6, 2024 · Artificial Intelligence

How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System

The article analyzes ByteDance and Peking University's MegaScale system that enables efficient, stable training of large language models on clusters exceeding ten thousand GPUs, detailing algorithmic tweaks, 3D parallel communication overlap, operator optimizations, data‑pipeline improvements, network tuning, and fault‑tolerance mechanisms that together achieve a 55.2% MFU on a 175B model.

Distributed SystemsGPU clustersLLM training

0 likes · 15 min read

How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System

OPPO Kernel Craftsman

Mar 22, 2024 · Artificial Intelligence

InternLM Model Fine-Tuning Tutorial with XTuner: Chat Format and Practical Implementation Guide

This tutorial walks through fine‑tuning Shanghai AI Lab’s open‑source InternLM models with XTuner, explaining chat‑format conventions, loading and inference (including multimodal InternLM‑XComposer), dataset preparation, configuration sections, DeepSpeed acceleration, and memory‑efficient QLoRA details for 7‑B‑parameter chat models.

Chat FormatDeepSpeedInternLM

0 likes · 22 min read

InternLM Model Fine-Tuning Tutorial with XTuner: Chat Format and Practical Implementation Guide

Baobao Algorithm Notes

Mar 21, 2024 · Artificial Intelligence

Can the CaR Method Achieve Better LLM Performance with Only 1.4% of Training Data?

This article explains how the CaR (Clustering and Ranking) approach evaluates data quality with a scoring model and selects diverse samples via PCA‑reduced sentence embeddings and K‑Means clustering, achieving comparable or superior large‑model performance while using just 1.96% of the original dataset.

CaR methodLLM trainingclustering

0 likes · 8 min read

Can the CaR Method Achieve Better LLM Performance with Only 1.4% of Training Data?

Alibaba Cloud Big Data AI Platform

Feb 28, 2024 · Artificial Intelligence

How PAI‑TorchAcc Supercharges OLMo LLM Training with Up to 1.64× Speedup

PAI‑TorchAcc, Alibaba Cloud’s PyTorch accelerator, integrates the open‑source OLMo large language model and delivers up to 1.64× faster training on OLMo‑1B and 1.52× on OLMo‑7B by leveraging graph capture, distributed, compute, communication, and memory optimizations, with detailed usage steps and performance analysis.

LLM trainingOLMoPAI‑TorchAcc

0 likes · 7 min read

How PAI‑TorchAcc Supercharges OLMo LLM Training with Up to 1.64× Speedup

Architects' Tech Alliance

Dec 24, 2023 · Artificial Intelligence

Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training

This article examines the main GPU/TPU cluster networking options—including NVLink, InfiniBand, RoCE Ethernet Fabric, and DDC full‑schedule networks—explaining their latency, loss‑less transmission, congestion control, cost, scalability, and suitability for large‑scale LLM training workloads.

GPU networkingInfiniBandLLM training

0 likes · 18 min read

Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training

Alibaba Cloud Big Data AI Platform

Sep 13, 2023 · Artificial Intelligence

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

This article introduces Pai‑Megatron‑Patch, an open‑source tool from Alibaba Cloud that streamlines large language model (LLM) training, weight conversion, FP8 mixed‑precision acceleration, and reinforcement‑learning workflows, providing detailed architecture, key features, code examples, and step‑by‑step usage instructions.

FP8LLM trainingMegatron

0 likes · 19 min read

How Pai‑Megatron‑Patch Accelerates Large Language Model Training on Alibaba Cloud

Baobao Algorithm Notes

Aug 21, 2023 · Artificial Intelligence

Mastering LLM Training: From Tokenizer Design to Instruction Tuning

This article provides a comprehensive, step‑by‑step guide to building large language models, covering tokenizer creation, vocabulary expansion, pre‑training strategies, dataset cleaning, instruction‑tuning techniques, and evaluation metrics such as C‑Eval and GPT‑4 based scoring.

LLM training

0 likes · 20 min read

Baobao Algorithm Notes

Jul 16, 2023 · Artificial Intelligence

Why High RM Scores Don't Guarantee Better LLMs: 7 RLHF Tricks for Stable PPO Training

The article examines why rising RM scores in large‑model training don't ensure superior LLM performance and presents seven practical RLHF tricks—ranging from KL‑penalty to global gradient clipping—that improve PPO stability and reduce resource overhead.

Artificial IntelligenceLLM trainingPPO

0 likes · 7 min read

Why High RM Scores Don't Guarantee Better LLMs: 7 RLHF Tricks for Stable PPO Training

21CTO

Apr 13, 2023 · Artificial Intelligence

How Microsoft’s Open‑Source DeepSpeed‑Chat Accelerates LLM Training by 15×

Microsoft has open‑sourced DeepSpeed‑Chat, a DeepSpeed‑based framework that simplifies end‑to‑end training and inference of ChatGPT‑style large language models, offering RL‑HF support, up to 15× speed‑up, massive cost reductions, and scalable performance on Azure for models ranging from billions to hundreds of billions of parameters.

AIDeepSpeedLLM training

0 likes · 7 min read

How Microsoft’s Open‑Source DeepSpeed‑Chat Accelerates LLM Training by 15×