How Attention Residuals Boost Transformer Efficiency and Scale

The article presents the Attention Residuals architecture, explains how it replaces uniform residual addition with learned attention‑based aggregation, details full and block variants, engineering tricks for distributed training, and shows extensive scaling‑law experiments where the new design consistently improves validation loss and training efficiency across model sizes.

Attention ResidualsModel ScalingTransformer

0 likes · 13 min read

How Attention Residuals Boost Transformer Efficiency and Scale

AIWalker

Mar 15, 2025 · Artificial Intelligence

How SANA 1.5 Lets Small Models Reach New Text‑to‑Image SOTA

SANA 1.5 introduces an efficient model‑growth pipeline, depth‑pruning, and inference‑time scaling that reuse a 1.6 B‑parameter foundation to train a 4.8 B model with 8× lower memory, 60 % less training time, and GenEval scores that rival or surpass much larger diffusion models.

Inference ScalingModel Scalingdiffusion

0 likes · 17 min read

How SANA 1.5 Lets Small Models Reach New Text‑to‑Image SOTA

Programmer DD

Apr 14, 2023 · Artificial Intelligence

How DeepSpeed-Chat Accelerates ChatGPT‑Style Model Training by 15×

Microsoft open‑sourced DeepSpeed‑Chat, a toolkit that streamlines the end‑to‑end training and inference of ChatGPT‑like large language models using RLHF, delivering up to fifteen‑fold speedups and dramatically lower costs, even on a single GPU.

ChatGPTDeepSpeedRLHF

0 likes · 8 min read

How DeepSpeed-Chat Accelerates ChatGPT‑Style Model Training by 15×

DataFunTalk

Jul 1, 2021 · Artificial Intelligence

Pre‑Trained Models: Past, Present, and Future – A Comprehensive Survey

This article surveys the evolution of pre‑trained models, covering the origins of transfer and self‑supervised learning, the rise of transformer‑based PTMs such as BERT and GPT, efficient architecture designs, multimodal and multilingual extensions, theoretical analyses, and future research directions for scalable and robust AI systems.

AI researchMultimodalefficient training

0 likes · 27 min read

Pre‑Trained Models: Past, Present, and Future – A Comprehensive Survey

How Attention Residuals Boost Transformer Efficiency and Scale

How SANA 1.5 Lets Small Models Reach New Text‑to‑Image SOTA

How DeepSpeed-Chat Accelerates ChatGPT‑Style Model Training by 15×

Pre‑Trained Models: Past, Present, and Future – A Comprehensive Survey

How SANA 1.5 Lets Small Models Reach New Text‑to‑Image SOTA