Artificial Intelligence 7 min read

How to Build and Train Sub‑1B Language Models from Scratch: Resources & Tips

This guide compiles open‑source repositories, research papers, and practical tricks for training miniature large‑language models under 1 billion parameters, helping readers learn by reproducing models like nanoGPT, tinyLlama, Phi‑1.5, and more.

NewBeeNLP

Jul 15, 2024

How to Build and Train Sub‑1B Language Models from Scratch: Resources & Tips

The author believes the best way to learn is to build a model from the ground up and shares a curated list of resources for training sub‑1B language models, given limited hardware.

nanoGPT – a minimal yet complete implementation of GPT‑2 by Karpathy, available in four sizes from 0.1B to 1.5B parameters.

https://www.kaggle.com/code/pritishmishra/gpt-training-on-wikipedia-dataset-from-scratch

https://zhuanlan.zhihu.com/p/79714797

https://zhuanlan.zhihu.com/p/606339093

https://finisky.github.io/2020/05/01/pretrainchinesegpt/

https://zhuanlan.zhihu.com/p/656758138

https://github.com/minimalist-nlp/gpt2-text-generation

tinyLlama – a miniature Llama replica trained over 90 days on 16 × A100‑40G GPUs, matching Llama’s architecture for seamless replacement.

pythia – EleutherAI’s repository offering models from 14 M to 12 B parameters for academic research.

OLMo – AllenAI’s open‑source LLM with 1B and 7B variants, providing full training data, code, and checkpoints.

Qwen1.5 – Alibaba’s Chinese‑focused LLM, smallest version 0.5B, regarded as a top performer for Chinese tasks.

Phi‑1.5 – Microsoft’s 350 M and 1.3 B models trained on high‑quality textbook‑style data using 6 B tokens on eight A100 GPUs for four days; followed by Phi‑2 (2.7 B) without a formal paper.

OpenELM – Apple’s suite of models ranging from 0.27 B to 3 B, targeting mobile deployment.

Community projects and smaller‑scale experiments include:

https://github.com/charent/ChatLM-mini-Chinese – 0.2 B Chinese model based on T5.

https://github.com/jiahe7ay/MINI_LLM – 1.4 B Chinese model built on Qwen.

https://github.com/DLLXW/baby-llama2-chinese – Llama‑2‑based Chinese model, intended 0.5 B but limited to 0.2 B.

https://github.com/OpenBMB/MiniCPM – 2.7 B model claimed to rival Mistral‑7B.

https://github.com/Chinese-Tiny-LLM/Chinese-Tiny-LLM – 2 B Chinese model still in training.

https://github.com/keeeeenw/MicroLlama – 0.3 B Llama variant, a further miniaturization of TinyLlama.

https://github.com/zhanshijinwat/Steel-LLM – Planned pre‑training project, not yet started.

Additional practical tips and resources for training small models:

Book "Build a LLM from Scratch" (13k ★ on GitHub, still in progress).

Awesome Chinese LLM list – curated datasets.

Paper "MobileLLM" – training tricks for compact models.

Article "Llama from Scratch" – analysis of key Llama components.

"Rethinking Optimization and Architecture for Tiny Language Models" – detailed review (https://zhuanlan.zhihu.com/p/681614203).

MNBVC – massive Chinese corpus for training.

RedPajama – replication of Llama’s dataset.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM open-source training small models nanoGPT

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.