Instruction Fine-Tuning Practices for Huawei's Pangu Large Language Model
This presentation details the concepts, methodologies, and experimental results of instruction fine‑tuning for Huawei's Pangu large language model, covering model scale, architecture, training strategies, data quality, parallelism techniques, and case studies on Chinese‑English translation and Thai language adaptation.
Introduction: The talk focuses on instruction fine‑tuning of the Pangu large language model, a natural‑language model developed by Huawei's Text Machine Translation Lab, which conducts both academic research and product deployment.
What is a large model? Large models have parameter counts ranging from hundreds of millions to trillions, e.g., GPT‑3 (175 B) and Pangu‑Sigma (1 T). They typically use a decoder‑only Transformer architecture, consist of stacked masked‑multi‑head attention and feed‑forward layers, and have evolved from statistical to neural to pre‑trained to large‑scale language models, shifting from task‑specific assistants to general problem solvers.
Instruction fine‑tuning (also called supervised fine‑tuning, SFT) trains a model on supervised instruction data so that it learns to follow human commands and generalises to downstream tasks. Examples show a model learning a sentiment‑classification task from an instruction.
Why fine‑tune? Pre‑training predicts the next token, which does not directly enable conversational QA. Instruction fine‑tuning lets the model understand and obey human instructions and generalise to tasks such as summarisation, sentiment analysis, QA, and natural‑language inference.
1. Compute foundation and training strategy
The fine‑tuning runs on the MindSpore framework with Ascend 910 chips. MindSpore is Huawei’s AI framework comparable to PyTorch, and Ascend 910 offers ~256 TFLOPS FP32, double V100 and slightly below A100.
Training strategies depend on model and data size: data parallelism for large data, pipeline or tensor parallelism for large models, and hybrid parallelism (combining pipeline, data, and tensor) for both.
Definitions: pipeline parallelism splits layers across devices; data parallelism replicates the whole model on each device and aggregates gradients; tensor parallelism splits weight matrices across devices.
2. Chinese‑English translation full‑parameter fine‑tuning project
Using Pangu‑Sigma as the base, bilingual data were used to fine‑tune the model for Chinese‑English translation, improving performance in medical translation tests. Four aspects were explored:
Data format – adding a single translation example to the instruction improves fine‑tuning effectiveness.
Data quality – high‑quality data (selected by similarity scoring) yields better results; only 25 % of the full dataset was needed for optimal performance.
Data size – larger high‑quality datasets lead to better translation quality.
Model size – larger models (38 B vs 2.6 B) achieve higher BLEU scores but require more training resources (e.g., 11 days on 128 Ascend 910 for 38 B).
3. Thai language efficient‑parameter fine‑tuning project
Challenges: vocabulary mismatch (Thai characters absent from the original mixed Chinese‑English vocab), method selection among efficient fine‑tuning techniques, and limited Thai instruction data.
Vocabulary solution: expand the tokenizer with a Thai vocab (~10 k tokens) and augment the input‑output layers, followed by incremental pre‑training on mixed Chinese‑English and Thai data.
Efficient fine‑tuning methods compared: Adapter, Prefix, Prompt, and LoRA. LoRA was chosen for its comparable performance to full‑parameter fine‑tuning without added inference latency.
Data selection: a self‑guided approach first fine‑tunes on a random subset, evaluates learning difficulty, and then re‑fine‑tunes on the hardest 10 % of data. Data augmentation via instruction back‑translation (TAGBT) further diversifies the instruction set.
Evaluation: a baseline was built by translating Thai inputs to English, feeding them to the pre‑fine‑tuned model, and translating outputs back to Thai. LoRA‑fine‑tuned models outperformed the baseline on NLG and NLU tasks, though incremental pre‑training incurs higher compute cost.
Takeaways
Data quality and quantity are crucial; strict filtering and large high‑quality datasets improve fine‑tuning.
Larger models have greater potential but higher training costs; balance is needed for production.
When compute is abundant, full‑parameter fine‑tuning yields the best performance; otherwise, efficient methods like LoRA achieve strong results with fewer resources. Parallelism strategy should match data and model size.
End of the presentation.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.