KwaiCoder-23BA4-v1: An Efficient Large Code Generation Model via Pruning, Knowledge Distillation, and Granular Upcycling
KwaiCoder-23BA4-v1 is a 23B wide MoE code‑completion model that achieves state‑of‑the‑art performance on HumanEval, BigCodeBench and Fill‑in‑Middle benchmarks by using high‑quality data, a cost‑effective training pipeline that combines model pruning, knowledge distillation and fine‑grained merging, and extensive ablation studies.
The KwaiCoder-23BA4-v1 model, released by the KwaiPilot team, is a 23B‑parameter wide mixture‑of‑experts (MoE) code‑generation model trained with a highly efficient pipeline that reduces training cost to roughly 1/30 of traditional methods while attaining new state‑of‑the‑art (SOTA) results on multiple code‑related benchmarks.
High‑Quality Data : The team curated a 3‑trillion‑token high‑quality pre‑training dataset comprising code, mathematical, and knowledge‑type texts. Data selection emphasized relevance and consistency over sheer volume, employing model‑based filtering, synthetic data augmentation, and multi‑stage cleaning to ensure dense knowledge content.
Efficient Training Route : The training process integrates three key techniques:
Pruning : Importance estimation on MLP, LayerNorm, attention heads, and layers guides parameter reduction, revealing that many open‑source models have under‑utilized knowledge density, especially in deeper layers.
Knowledge Distillation : The unpruned model serves as a teacher, while the pruned model is the student; intermediate representations and logits are distilled to close the performance gap.
Granular Upcycling (Fine‑grained Merging) : After pruning and distillation, the student model is split by width to create expert sub‑models, which are merged into a MoE architecture with parameter scaling to maintain lossless integration.
Model Performance : On HumanEval and HumanEval+ the model reaches Pass@1 scores of 82.9% and 76.2%, surpassing the previous best base model (OpenCoder‑8B). In BigCodeBench‑Complete and Hard subsets, KwaiCoder‑23BA4‑v1 with only 4B activation parameters trails only behind Qwen2.5‑Coder 32B. It also leads on multilingual (Multipl‑e) and Fill‑in‑Middle tasks, achieving SOTA levels.
Ablation Studies : Over 350 experiments demonstrated that excessive width pruning harms final performance, while the combined pruning‑distillation‑upcycling pipeline stabilizes early‑stage loss and raises the model’s capability ceiling compared to traditional upcycling or direct CPT.
Training Details : The final CPT stage employs a three‑phase training schedule with publicly disclosed data ratios and learning‑rate schedules, facilitating reproducibility and further community development.
Conclusion & Future Work : KwaiCoder‑23BA4‑v1 marks a milestone in cost‑effective code‑generation model development. Future directions include deeper exploration of pruning and knowledge compression, designing more efficient training and inference algorithms leveraging customizable model structures, and expanding open‑source collaborations with academia and industry.
References
Muralidharan et al., "Compact language models via pruning and knowledge distillation", NeurIPS 2024.
He et al., "Upcycling large language models into mixture of experts", arXiv 2024.
Zhu et al., "DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence", arXiv 2024.
Huang et al., "Opencoder: The open cookbook for top-tier code large language models", arXiv 2024.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.