Artificial Intelligence 10 min read

KwaiCoder-23BA4-v1: An Efficient Large Code Generation Model via Pruning, Knowledge Distillation, and Granular Upcycling

KwaiCoder-23BA4-v1 is a 23B wide MoE code‑completion model that achieves state‑of‑the‑art performance on HumanEval, BigCodeBench and Fill‑in‑Middle benchmarks by using high‑quality data, a cost‑effective training pipeline that combines model pruning, knowledge distillation and fine‑grained merging, and extensive ablation studies.

Kuaishou Tech

Jan 24, 2025

KwaiCoder-23BA4-v1: An Efficient Large Code Generation Model via Pruning, Knowledge Distillation, and Granular Upcycling

The KwaiCoder-23BA4-v1 model, released by the KwaiPilot team, is a 23B‑parameter wide mixture‑of‑experts (MoE) code‑generation model trained with a highly efficient pipeline that reduces training cost to roughly 1/30 of traditional methods while attaining new state‑of‑the‑art (SOTA) results on multiple code‑related benchmarks.

High‑Quality Data : The team curated a 3‑trillion‑token high‑quality pre‑training dataset comprising code, mathematical, and knowledge‑type texts. Data selection emphasized relevance and consistency over sheer volume, employing model‑based filtering, synthetic data augmentation, and multi‑stage cleaning to ensure dense knowledge content.

Efficient Training Route : The training process integrates three key techniques:

Pruning : Importance estimation on MLP, LayerNorm, attention heads, and layers guides parameter reduction, revealing that many open‑source models have under‑utilized knowledge density, especially in deeper layers.

Knowledge Distillation : The unpruned model serves as a teacher, while the pruned model is the student; intermediate representations and logits are distilled to close the performance gap.

Granular Upcycling (Fine‑grained Merging) : After pruning and distillation, the student model is split by width to create expert sub‑models, which are merged into a MoE architecture with parameter scaling to maintain lossless integration.

Model Performance : On HumanEval and HumanEval+ the model reaches Pass@1 scores of 82.9% and 76.2%, surpassing the previous best base model (OpenCoder‑8B). In BigCodeBench‑Complete and Hard subsets, KwaiCoder‑23BA4‑v1 with only 4B activation parameters trails only behind Qwen2.5‑Coder 32B. It also leads on multilingual (Multipl‑e) and Fill‑in‑Middle tasks, achieving SOTA levels.

Ablation Studies : Over 350 experiments demonstrated that excessive width pruning harms final performance, while the combined pruning‑distillation‑upcycling pipeline stabilizes early‑stage loss and raises the model’s capability ceiling compared to traditional upcycling or direct CPT.

Training Details : The final CPT stage employs a three‑phase training schedule with publicly disclosed data ratios and learning‑rate schedules, facilitating reproducibility and further community development.

Conclusion & Future Work : KwaiCoder‑23BA4‑v1 marks a milestone in cost‑effective code‑generation model development. Future directions include deeper exploration of pruning and knowledge compression, designing more efficient training and inference algorithms leveraging customizable model structures, and expanding open‑source collaborations with academia and industry.

References

Muralidharan et al., "Compact language models via pruning and knowledge distillation", NeurIPS 2024.

He et al., "Upcycling large language models into mixture of experts", arXiv 2024.

Zhu et al., "DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence", arXiv 2024.

Huang et al., "Opencoder: The open cookbook for top-tier code large language models", arXiv 2024.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

code generation AI Large Language Model benchmark pruning model training Knowledge Distillation

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.