Achieving 4.6× Faster Diffusion Model Training with FP4‑BF16 Dual‑Track Parallelism (Sol‑RL)

Sol‑RL, a framework from NVIDIA, Hong Kong University and MIT, integrates NVFP4 inference for large‑scale rollout exploration and BF16 precision for high‑fidelity regeneration, delivering up to 4.64× faster convergence at equivalent reward levels while preserving BF16 training fidelity across SANA, FLUX.1 and SD3.5‑L models.

BF16FP4GPU optimization

0 likes · 9 min read

Achieving 4.6× Faster Diffusion Model Training with FP4‑BF16 Dual‑Track Parallelism (Sol‑RL)

Code DAO

Jan 15, 2022 · Artificial Intelligence

How Intel BF16 with IPEX and oneDNN Boosts PyTorch Performance

This article explains how Intel and Facebook's BF16 support, combined with the Intel Extension for PyTorch (IPEX) and oneDNN, automates type and layout conversions and adds graph‑fusion optimizations, delivering 1.4×‑4.3× inference and up to 2.4× training speedups on Xeon CPUs for models such as DLRM, BERT‑Large, and ResNext‑101‑32x4d.

BF16CPU accelerationIPEX

0 likes · 13 min read

How Intel BF16 with IPEX and oneDNN Boosts PyTorch Performance