Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework
This article introduces the background and challenges of large language model training, explains the Mixture-of-Experts (MoE) architecture, and details several optimization techniques implemented in the QLM framework—including fine-grained and shared experts, top‑k gating, token distribution, expert parallelism, and grouped GEMM – to improve training efficiency and performance.
Background
In recent years, the scale and complexity of deep learning models have grown dramatically, delivering significant performance gains but also increasing training and inference costs. To improve efficiency, researchers have introduced Mixture-of-Experts (MoE) models, which have become a focus in large language model development such as Qwen, Mixtral, and GPT‑4.
QLM Introduction
The QLM framework was developed by 360 Smart Engineering on top of Nvidia's Megatron‑LM distributed training system to address usability and acceleration gaps in the original framework.
MoE Introduction
MoE, originally proposed in the 1991 paper “Adaptive Mixture of Local Experts” and later combined with Transformers in the 2020 GShard paper, uses a gating mechanism to dynamically select one or more expert sub‑models for each input token, allowing specialized processing without increasing overall computation.
Compared with dense models, MoE replaces the MLP layers in a Transformer with multiple expert MLPs, activating only the selected experts and thereby reducing compute while improving model expressiveness.
MoE Training Optimizations in QLM
1. Fine‑grained Experts – Inspired by the DeepSeekMoE paper, splitting large experts into many smaller ones (e.g., 64 fine‑grained experts for a 2B model) improves specialization without changing total compute.
2. Shared Experts – A permanently active shared expert captures generic knowledge, reducing redundancy among experts and improving overall efficiency.
3. Top‑k Gating – Increasing the number of experts a token can attend to (e.g., top‑k = 2 in GShard, top‑k = 6 in DeepSeekMoE) enhances performance at the cost of higher communication, which QLM mitigates via configurable all‑gather or all‑to‑all strategies.
4. Token Distribution Optimization
To avoid token‑distribution imbalance, QLM employs expert capacity limits and router load‑balancing techniques such as aux‑loss and the Sinkhorn algorithm, ensuring a more uniform training opportunity across experts.
5. Expert Parallelism – Experts are placed on different GPUs, extending the parallelism beyond tensor and pipeline parallelism and enabling training of extremely large MoE models.
6. Grouped‑GEMM – By grouping multiple expert weight matrices and corresponding input matrices into larger matrices, QLM leverages CUTLASS’s grouped‑gemm to accelerate matrix multiplication in a multi‑expert setting.
Empirical results on the 360zhinao 2B MoE model show up to a 50% performance boost.
Conclusion
MoE represents a key trend in Transformer evolution, offering faster training and inference than dense models while introducing challenges such as increased communication and memory demand. The QLM platform now supports a suite of optimizations—fine‑grained and shared experts, top‑k gating, balanced token routing, expert parallelism, and grouped‑GEMM—to address these challenges and will continue to evolve for 360’s internal large‑model workloads.
360 Smart Cloud
Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.