Artificial Intelligence 10 min read

Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework

This article introduces the background and challenges of large language model training, explains the Mixture-of-Experts (MoE) architecture, and details several optimization techniques implemented in the QLM framework—including fine-grained and shared experts, top‑k gating, token distribution, expert parallelism, and grouped GEMM – to improve training efficiency and performance.

360 Smart Cloud

Jul 4, 2024

Optimizing Mixture-of-Experts (MoE) Training with the QLM Framework

Background

In recent years, the scale and complexity of deep learning models have grown dramatically, delivering significant performance gains but also increasing training and inference costs. To improve efficiency, researchers have introduced Mixture-of-Experts (MoE) models, which have become a focus in large language model development such as Qwen, Mixtral, and GPT‑4.

QLM Introduction

The QLM framework was developed by 360 Smart Engineering on top of Nvidia's Megatron‑LM distributed training system to address usability and acceleration gaps in the original framework.

MoE Introduction

MoE, originally proposed in the 1991 paper “Adaptive Mixture of Local Experts” and later combined with Transformers in the 2020 GShard paper, uses a gating mechanism to dynamically select one or more expert sub‑models for each input token, allowing specialized processing without increasing overall computation.

Compared with dense models, MoE replaces the MLP layers in a Transformer with multiple expert MLPs, activating only the selected experts and thereby reducing compute while improving model expressiveness.

MoE Training Optimizations in QLM

1. Fine‑grained Experts – Inspired by the DeepSeekMoE paper, splitting large experts into many smaller ones (e.g., 64 fine‑grained experts for a 2B model) improves specialization without changing total compute.

2. Shared Experts – A permanently active shared expert captures generic knowledge, reducing redundancy among experts and improving overall efficiency.

3. Top‑k Gating – Increasing the number of experts a token can attend to (e.g., top‑k = 2 in GShard, top‑k = 6 in DeepSeekMoE) enhances performance at the cost of higher communication, which QLM mitigates via configurable all‑gather or all‑to‑all strategies.

4. Token Distribution Optimization

To avoid token‑distribution imbalance, QLM employs expert capacity limits and router load‑balancing techniques such as aux‑loss and the Sinkhorn algorithm, ensuring a more uniform training opportunity across experts.

5. Expert Parallelism – Experts are placed on different GPUs, extending the parallelism beyond tensor and pipeline parallelism and enabling training of extremely large MoE models.

6. Grouped‑GEMM – By grouping multiple expert weight matrices and corresponding input matrices into larger matrices, QLM leverages CUTLASS’s grouped‑gemm to accelerate matrix multiplication in a multi‑expert setting.

Empirical results on the 360zhinao 2B MoE model show up to a 50% performance boost.

Conclusion

MoE represents a key trend in Transformer evolution, offering faster training and inference than dense models while introducing challenges such as increased communication and memory demand. The QLM platform now supports a suite of optimizations—fine‑grained and shared experts, top‑k gating, balanced token routing, expert parallelism, and grouped‑GEMM—to address these challenges and will continue to evolve for 360’s internal large‑model workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

model optimization AI large language models Mixture of Experts distributed training QLM

Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.