Tagged articles
6 articles
Page 1 of 1
Machine Heart
Machine Heart
May 27, 2026 · Artificial Intelligence

AMD Paper Finds FP4 Training Instability Is Not Due to Randomness, 9‑10% Speedup

The authors demonstrate that FP4 training instability stems from structural micro‑scaling errors in the weight‑gradient path rather than insufficient randomness, and show that a deterministic Hadamard rotation restores convergence, delivering a 9‑10% end‑to‑end speedup on native FP4 hardware (AMD MI355X) while incurring only 8‑9% token overhead.

Deterministic HadamardFP4MXFP4
0 likes · 10 min read
AMD Paper Finds FP4 Training Instability Is Not Due to Randomness, 9‑10% Speedup
Architects' Tech Alliance
Architects' Tech Alliance
May 26, 2026 · Artificial Intelligence

Huawei Ascend 950 NPU Architecture Deep Dive – Full Whitepaper Inside

The article provides a detailed technical analysis of Huawei's Ascend 950 NPU series, covering its one‑chip dual‑structure for training and inference, SIMD/SIMT dual‑mode compute, ultra‑fine memory granularity, PD separation, native FP4 support, a high‑bandwidth 2.0 interconnect, and a fully self‑developed yet CUDA‑compatible ecosystem.

AI acceleratorAscend 950FP4
0 likes · 10 min read
Huawei Ascend 950 NPU Architecture Deep Dive – Full Whitepaper Inside
CodeTrend
CodeTrend
Apr 26, 2026 · Artificial Intelligence

DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design

DeepSeek V4, released in April 2026, introduces two versions—Pro and Flash—with up to 1.6 trillion parameters and a million‑token context window, leveraging hybrid attention, compressed KV cache, and specialized training techniques to dramatically cut hardware dependence and inference cost.

DeepSeekFP4Hybrid Attention
0 likes · 5 min read
DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design
Machine Heart
Machine Heart
Apr 16, 2026 · Artificial Intelligence

Achieving 4.6× Faster Diffusion Model Training with FP4‑BF16 Dual‑Track Parallelism (Sol‑RL)

Sol‑RL, a framework from NVIDIA, Hong Kong University and MIT, integrates NVFP4 inference for large‑scale rollout exploration and BF16 precision for high‑fidelity regeneration, delivering up to 4.64× faster convergence at equivalent reward levels while preserving BF16 training fidelity across SANA, FLUX.1 and SD3.5‑L models.

BF16FP4GPU optimization
0 likes · 9 min read
Achieving 4.6× Faster Diffusion Model Training with FP4‑BF16 Dual‑Track Parallelism (Sol‑RL)
Data Party THU
Data Party THU
Sep 4, 2025 · Artificial Intelligence

How MXFP4 Quantization Lets a 1200‑Billion‑Parameter LLM Run on a Single 80GB GPU

This article analyzes the memory bottleneck of massive language models, explains the mathematical modeling of memory requirements, evaluates traditional sharding limits, and details how GPT‑OSS’s MXFP4 quantization combined with Mixture‑of‑Experts reduces memory, bandwidth, and compute demands enough to fit a 1200‑billion‑parameter model onto an 80 GB GPU with minimal accuracy loss.

FP4LLMMXFP4
0 likes · 11 min read
How MXFP4 Quantization Lets a 1200‑Billion‑Parameter LLM Run on a Single 80GB GPU