LoongForge Boosts Multimodal Training Speed by 45% on GPU and Kunlun XPU

LoongForge, Baidu Baige’s open‑source full‑modal training framework, unifies LLM, VLM and VLA workloads, runs unchanged on NVIDIA GPUs and Kunlun XPU, and delivers 15‑45% end‑to‑end speedups with up to 90% linear scaling on 5,000‑plus card clusters, while simplifying model integration via YAML.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
LoongForge Boosts Multimodal Training Speed by 45% on GPU and Kunlun XPU

When large‑scale models begin to understand images, video and the physical world, the question arises whether the LLM‑era infrastructure can still train the next generation of multimodal models efficiently. The answer is no: the training system and model architecture have become structurally misaligned.

Industry background

In the past three years, multimodal capability has shifted from a plug‑in visual encoder (e.g., InternVL, Qwen3‑VL) to a core component that shares the same learning mechanism with the language backbone (e.g., Ernie 4.5, Qwen3.6, Kimi K2.6). At the same time, heterogeneous compute platforms have emerged: domestic Kunlun P800 chips have moved from pilot projects to large‑scale clusters, making cross‑platform execution a basic requirement.

Core challenges in the multimodal era

Iteration speed is throttled by engineering complexity: high‑performance frameworks such as Megatron tightly couple model definition with distributed strategies, leading to weeks‑long adaptation cycles; FSDP is easier to adopt but suffers communication and memory bottlenecks at extreme scale.

Heterogeneous components cause hidden performance loss: the parameter gap between vision (ViT ≈ 300 M) and language (LLM up to hundreds of billions) makes a single parallel strategy sub‑optimal, and uneven multimodal data creates load imbalance that forces some GPUs to wait for the slowest rank.

Cross‑platform migration incurs sunk cost: community frameworks are often bound to a specific hardware ecosystem, requiring separate code branches for domestic chips and delivering a noticeable gap between “can run” and “run efficiently”.

LoongForge positioning and core value

To address these issues, Baidu Baige released LoongForge, an open‑source full‑modal training framework built on the Megatron engine. It provides a unified, high‑performance, and easy‑to‑use solution that runs unchanged on both NVIDIA GPUs and Kunlun XPU, covering LLM, VLM, VLA and diffusion scenarios.

Architecture overview

LoongForge consists of three layers: a model layer that abstracts multimodal networks, a system layer that optimizes end‑to‑end efficiency, and a hardware layer that bridges GPU and XPU.

Model layer: unified abstraction

All multimodal models share a common backbone (LLM) with peripheral encoders. LoongForge introduces a three‑part abstraction—Encoder, Foundation, and OmniCombinationModel—allowing developers to describe any multimodal architecture with a single YAML file. The framework automatically generates the network topology and parallel strategy, making the complexity invisible to model developers.

System layer: layered performance gains

CCT (Computation‑Communication‑Transfer) parallelism : hides the All‑to‑All cost of MoE long‑context training by offloading memory and overlapping compute, communication and data transfer. Qwen3‑30B‑A3B on a 32K sequence gains 16% over the A800 baseline, while competing solutions OOM on the same hardware.

ChunkPipe pipeline parallelism : converts the linear memory growth of ultra‑long sequences into a fixed overhead, enabling 1M‑level context training without extra sequence parallelism.

DSA operator fusion : deep fusion of sparse‑attention kernels for DeepSeek V3.2 yields an ≈5× end‑to‑end speedup compared with non‑CUDA‑fused versions.

DP load balancing : dynamically re‑orders multimodal samples each iteration to eliminate the load imbalance caused by heterogeneous data lengths, a key factor behind the >90% linear scaling on 5,000+ Kunlun P800 cards.

Heterogeneous model parallelism : assigns independent parallel configurations to vision and language components, achieving up to 50% higher throughput for Qwen3‑VL‑30B‑A3B compared with community baselines.

Adaptive FP8 (Selective FP8) : selects FP8 or BF16 per layer and per component based on offline benchmarks, avoiding the performance regression of a one‑size‑fits‑all FP8 setting; on Qwen3‑VL 235B (16K) it adds another ~10% gain.

Hardware layer: single code, multiple platforms

On the GPU side LoongForge uses the native PyTorch/CUDA interface of Megatron, preserving peak performance. On the XPU side a plug‑in layer abstracts the differences between Kunlun and NVIDIA, enabling zero‑invasion adaptation of the Megatron engine. Switching hardware only requires changing an environment variable.

Performance numbers

Across a range of representative models, LoongForge consistently outperforms community frameworks on identical hardware:

Qwen3‑30B‑A3B (MoE, 32K sequence): +16%

DeepSeek V3.2 (MoE, 8K sequence): +480%

Qwen3‑Next (MoE, 32K sequence): +15%

Qwen3‑VL‑30B‑A3B (VLM, 32K sequence): +45%

PI0.5 (VLA, BF16): +49%

Overall end‑to‑end training acceleration ranges from 15% to 45% on mainstream models, with up to 4.8× speedup on cutting‑edge architectures and >90% linear scaling on a 5,000‑card Kunlun cluster.

Typical production cases

LLaVA‑OneVision‑2.0 : a full‑frame‑rate multimodal vision‑language model that reduces video token consumption while matching Qwen3‑VL accuracy, benefitting from LoongForge’s heterogeneous parallelism and load‑balancing.

LLaVA‑OneVision‑1.5 : integrates the new RICE‑ViT encoder; the team adapted the encoder in a few days and completed 8B VLM Stage‑1.5 pre‑training on 128 A800 GPUs using LoongForge’s out‑of‑the‑box support.

Qianfan‑VL series (3B/8B/70B) : enterprise‑grade multimodal models trained on >5,000 Kunlun P800 cards, processing 3 T tokens with >90% cluster efficiency thanks to LoongForge’s 3D parallelism and communication‑compute fusion.

Operation demo: YAML‑driven workflow

Model definition, training strategy and data handling are all expressed in declarative YAML files. Changing the backbone from Qwen3 to DeepSeek only requires editing a single line:

defaults:
  - ../../models/[email protected]_encoder: qwen3_vit
  - ../../models/[email protected]_projector: qwen_mlp_adapter
- ../../models/[email protected]: qwen3_30b_a3b
+ ../../models/[email protected]: deepseek_v3
  - _self_

Training arguments retain the Megatron style while allowing component‑wise parallelism configuration via Hydra:

TRAINING_ARGS=(
    --training-phase sft
    --seq-length 32768
    --micro-batch-size 1
    ...
)

MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 1
    --pipeline-model-parallel-size 2
    --expert-model-parallel-size 8
    ...
)

# Component‑level parallelism
+model.image_encoder.tensor-model-parallel-size=1
+model.foundation.tensor-model-parallel-size=4
+model.image_encoder.freeze=True
+model.foundation.freeze=True

Weight handling supports both offline conversion to Megatron format and direct loading of HuggingFace checkpoints, with a one‑click export back to HF after training:

TRAINING_ARGS=(
    --load $CHECKPOINT_PATH          # HF directory
    --save $CHECKPOINT_PATH          # high‑performance checkpoint
    --save-interval 40
    --save-hf true
    --save-hf-path /path/to/output
    ...
)

Data preprocessing is a single command that converts raw multimodal data into the framework’s WebDataset format:

python tools/data_preprocess/vlm/convert_to_webdataset.py \
  --output_dir /workspace/wds_data/ \
  --json_file tests/datasets/vlm/mllm_demo.json \
  --image_dir tests/datasets/vlm/ \
  --video_dir tests/datasets/vlm/ \
  --media mix \
  --columns_messages messages \
  --maxcount 10000 \
  --maxsize 3000000000 \
  --sample_type multi_mix_qa

Training is launched with a single command; configuration examples for all supported models reside in configs/models/ and scripts in examples/.

Roadmap

Expand model support to Kimi K2.6, DeepSeek V4 and more embodied models.

Enable million‑token sequence training with lower memory overhead.

Continue performance gains in parallel strategies, operator fusion and communication scheduling.

Integrate training‑to‑inference co‑optimization (MTP) for faster decoding.

Improve tooling to further lower the barrier for model adaptation and tuning.

Conclusion

Tools that collapse complexity accelerate progress; LoongForge, released under Apache 2.0, aims to become the public AI infrastructure for the native multimodal era, letting valuable ideas be validated quickly and at scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationGPUYAMLAI infrastructuremultimodal trainingKunlun XPULoongForge
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.