How LiteScale Cuts Wait Times in Large‑Model Post‑Training with Gradient Accumulation

The article examines the bottleneck of synchronous rollout in large‑model post‑training, proposes an asynchronous design using gradient accumulation and a global micro‑batch count to preserve loss equivalence, and introduces LogitsExpress for efficient top‑K knowledge‑distillation communication, all implemented in the lightweight LiteScale framework.

Knowledge DistillationLarge Language Modelsasynchronous rollout

0 likes · 16 min read

How LiteScale Cuts Wait Times in Large‑Model Post‑Training with Gradient Accumulation

Baobao Algorithm Notes

Oct 20, 2024 · Artificial Intelligence

Why Gradient Accumulation Isn’t Always Equivalent to Large‑Batch Training for LLMs

A recently discovered bug in popular LLM libraries shows that gradient accumulation can introduce significant accuracy loss compared to true large‑batch training, especially when sequence lengths vary, and the issue can be fixed by correcting the loss denominator scaling.

LLM trainingdeep learninggradient accumulation

0 likes · 6 min read

Why Gradient Accumulation Isn’t Always Equivalent to Large‑Batch Training for LLMs

Alimama Tech

Sep 12, 2023 · Artificial Intelligence

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Megatron-LLaMA is an open‑source high‑performance training framework for LLaMA models, offering tensor, pipeline, and sequence parallelism, an overlapped optimizer, and near‑linear scalability, achieving up to 176% speedup on 32 GPUs and robust performance even with limited network bandwidth.

DeepSpeedGPU optimizationLLaMA

0 likes · 10 min read

Megatron-LLaMA: High-Performance Large Language Model Training Framework