TransLLM: A Framework for Cross‑Language Transfer of Conversational Large Language Models
This article presents TransLLM, a cross‑language migration framework that enables high‑quality conversational LLMs to be transferred to low‑resource languages by preserving advanced capabilities through Recovery KD, LoRA‑based continual pre‑training, and a translation‑thinking‑chain, with extensive experiments showing superior performance and safety over ChatGPT and GPT‑4.
Background : Most Retrieval‑Augmented Generation (RAG) research focuses on high‑resource languages such as English and Chinese, leaving a gap for low‑resource languages that lack high‑quality dialogue LLMs and strong contextual understanding.
Problem : Large language models exhibit strong abilities only in the languages they were trained on; in languages like Thai they often misunderstand prompts and fail to refuse harmful queries, exposing safety risks.
TransLLM Framework :
Stage 1 – Vocabulary expansion and LoRA‑based parameter freezing to improve basic language ability without catastrophic forgetting.
Stage 2 – Target‑language monolingual pre‑training (e.g., Thai) to boost foundational competence.
Stage 3 – Translation pre‑training (TCOT) that teaches the model to translate between the target language and English while preserving English knowledge.
Stage 4 – Migration training using three data sources: Recovery KD: self‑generated data from the original chat LLM to retain high‑level capabilities. TCOT (translation‑thinking‑chain): decomposes complex tasks into translation‑then‑answer steps. Instruction‑translation data: signals when the model should translate versus answer directly.
Key Techniques :
Recovery KD – uses the chat LLM itself to generate distillation data, avoiding external high‑quality datasets.
LoRA – freezes original parameters while learning new language‑specific knowledge, creating a shortcut that separates old and new knowledge.
Translation‑thinking‑chain – translates the target‑language query to English, leverages the strong English knowledge, then translates the answer back.
Experiments :
Model: LLaMA‑2‑chat‑7B transferred to Thai using ~11 B Thai tokens and 1 M parallel sentences.
Benchmarks: MT‑Bench (Thai), Alpaca‑Eval (Thai), AdvBenchmark (Thai) evaluated by both human annotators and GPT‑4.
Results: TransLLM outperforms ChatGPT on MT‑Bench (Δ win‑rate), approaches GPT‑4, achieves 94.61 % refusal rate on safety tests, and surpasses an external NLLB‑3.3B translation bridge.
Ablation studies confirm the importance of migrating a chat model, target‑language pre‑training, translation training, Recovery KD, and the synergy between LoRA and KD.
Conclusions : TransLLM successfully transfers advanced conversational abilities to low‑resource languages while mitigating catastrophic forgetting and improving safety; the framework is open‑source and the paper is available on arXiv.
Paper: https://arxiv.org/abs/2405.13923 Code & Data: https://github.com/hy5468/TransLLM
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.