AMD Paper Finds FP4 Training Instability Is Not Due to Randomness, 9‑10% Speedup

The authors demonstrate that FP4 training instability stems from structural micro‑scaling errors in the weight‑gradient path rather than insufficient randomness, and show that a deterministic Hadamard rotation restores convergence, delivering a 9‑10% end‑to‑end speedup on native FP4 hardware (AMD MI355X) while incurring only 8‑9% token overhead.

Deterministic HadamardFP4MXFP4

0 likes · 10 min read

AMD Paper Finds FP4 Training Instability Is Not Due to Randomness, 9‑10% Speedup

Architects' Tech Alliance

May 26, 2026 · Artificial Intelligence

Huawei Ascend 950 NPU Architecture Deep Dive – Full Whitepaper Inside

The article provides a detailed technical analysis of Huawei's Ascend 950 NPU series, covering its one‑chip dual‑structure for training and inference, SIMD/SIMT dual‑mode compute, ultra‑fine memory granularity, PD separation, native FP4 support, a high‑bandwidth 2.0 interconnect, and a fully self‑developed yet CUDA‑compatible ecosystem.

AI acceleratorAscend 950FP4

0 likes · 10 min read

Huawei Ascend 950 NPU Architecture Deep Dive – Full Whitepaper Inside

CodeTrend

Apr 26, 2026 · Artificial Intelligence

DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design

DeepSeek V4, released in April 2026, introduces two versions—Pro and Flash—with up to 1.6 trillion parameters and a million‑token context window, leveraging hybrid attention, compressed KV cache, and specialized training techniques to dramatically cut hardware dependence and inference cost.

DeepSeekFP4Hybrid Attention

0 likes · 5 min read

DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design

Old Zhang's AI Learning

Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Surge: Technical Specs, Quantization Details, Deployment Costs, and Market Impact

The article compiles key information on DeepSeek V4, covering Ollama's one‑click launch, the model's FP4/FP8 mixed‑precision quantization, size reductions, high local deployment costs, recent benchmark rankings, and the accompanying stock price movements in both China and the US.

AI benchmarksDeepSeek V4FP4

0 likes · 5 min read

DeepSeek V4 Surge: Technical Specs, Quantization Details, Deployment Costs, and Market Impact

Machine Heart

Apr 16, 2026 · Artificial Intelligence

Achieving 4.6× Faster Diffusion Model Training with FP4‑BF16 Dual‑Track Parallelism (Sol‑RL)

Sol‑RL, a framework from NVIDIA, Hong Kong University and MIT, integrates NVFP4 inference for large‑scale rollout exploration and BF16 precision for high‑fidelity regeneration, delivering up to 4.64× faster convergence at equivalent reward levels while preserving BF16 training fidelity across SANA, FLUX.1 and SD3.5‑L models.

BF16FP4GPU optimization

0 likes · 9 min read

Achieving 4.6× Faster Diffusion Model Training with FP4‑BF16 Dual‑Track Parallelism (Sol‑RL)

Data Party THU

Sep 4, 2025 · Artificial Intelligence

How MXFP4 Quantization Lets a 1200‑Billion‑Parameter LLM Run on a Single 80GB GPU

This article analyzes the memory bottleneck of massive language models, explains the mathematical modeling of memory requirements, evaluates traditional sharding limits, and details how GPT‑OSS’s MXFP4 quantization combined with Mixture‑of‑Experts reduces memory, bandwidth, and compute demands enough to fit a 1200‑billion‑parameter model onto an 80 GB GPU with minimal accuracy loss.

FP4LLMMXFP4

0 likes · 11 min read

How MXFP4 Quantization Lets a 1200‑Billion‑Parameter LLM Run on a Single 80GB GPU