Artificial Intelligence 24 min read

Understanding Pretraining and Fine‑Tuning of Large Language Models: Methods, Resources, and Practical Applications

This article explains the concepts of pretraining and fine‑tuning for large language models, compares full‑parameter, LoRA and QLoRA approaches, discusses resource consumption, introduces the ModelScope SWIFT framework with code examples, and shows how fine‑tuning can improve data‑visualisation tasks while reducing token usage.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Understanding Pretraining and Fine‑Tuning of Large Language Models: Methods, Resources, and Practical Applications

Background – Large language models (LLMs) are initially trained as pure text‑generation models; without fine‑tuning or reinforcement learning from human feedback (RLHF) they cannot follow user instructions. Examples from the open‑source Yi‑34B model illustrate that pre‑training alone yields grammatically correct but non‑dialogue outputs.

After chat‑style data fine‑tuning, the model learns to engage in normal conversations using a question‑answer format.

What is Pre‑training? Pre‑training is an unsupervised learning phase where a model consumes massive unlabeled text (and code) to acquire general linguistic, semantic, and reasoning abilities. High‑quality pre‑training data improve generalisation and reduce downstream task training costs.

When to Use Pre‑training

Training a new general‑purpose LLM from scratch.

Supplementing Chinese data for base models such as LLaMA‑2.

Increasing code‑related data to boost code‑generation capabilities.

Adapting to a newly released programming language.

What is Fine‑tuning? Fine‑tuning uses supervised question‑answer pairs to adjust the pretrained parameters so the model knows which output to produce for a given input, building on the knowledge already learned during pre‑training.

Customisation via Fine‑tuning

Control style, tone, format (e.g., concise responses under 50 characters).

Improve reliability of generated outputs such as JSON or pure code blocks.

Enhance specific downstream tasks like Pandas data‑analysis code generation.

Reduce token consumption by embedding frequent patterns in the model instead of lengthy prompts.

Can Fine‑tuning Add New Knowledge? Generally not recommended for factual updates; Retrieval‑Augmented Generation (RAG) with vector search is preferred for trustworthy knowledge injection.

Fine‑tuning Methods

Full‑parameter fine‑tuning – adjusts every weight in the model.

LoRA – adds low‑rank adapters (two small matrices) to approximate the weight updates, drastically reducing memory usage.

QLoRA – quantises the base model to 4‑bit, freezes it, and trains low‑rank adapters, further cutting GPU memory (e.g., a 33B model drops from ~80 GB to ~20 GB).

Empirical results on SQL generation show LoRA can match full‑parameter fine‑tuning, with LoRA‑13B slightly outperforming full‑parameter‑7B.

Resource consumption comparison (6B model):

Training Method

Full‑parameter

LoRA

4‑bit QLoRA

GPU Memory (MB)

68450

15226

8422

Thus LoRA/QLoRA provide a cost‑effective balance of performance and hardware requirements.

Choosing a Fine‑tuning Framework – ModelScope SWIFT (Scalable lightweight Infrastructure for Fine‑Tuning) offers a CLI that launches training with just a few configuration lines. Example command:

PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python llm_sft.py \
    --model_id_or_path 01ai/Yi-6B \
    --sft_type lora \
    --tuner_backend swift \
    --dtype fp16 \
    --output_dir output \
    --num_train_epochs 5 \
    --max_length 2048 \
    --lora_rank 8 \
    --gradient_checkpointing true \
    --batch_size 1 \
    --learning_rate 1e-4 \
    --custom_train_dataset_path /root/train.jsonl \
    --custom_val_dataset_path /root/train_eval.jsonl

Training data format (JSONL):

{"query": "11111", "response": "22222"}
{"query": "aaaaa", "response": "bbbbb"}
{"query": "AAAAA", "response": "BBBBB"}

Fine‑tuning for Data‑Visualization Scenarios – By converting natural‑language analytics requests into JSON chart configurations, LLMs can lower the barrier for non‑technical users. Prompt engineering with examples improves accuracy, but examples increase token cost. Fine‑tuning with annotated query/response pairs offers a scalable alternative.

Sample annotated data (excerpt):

{"query": "...用户需求: 今年第二季度各销售员在北方地区的销售额与退货率...", "response": "```json\n{\n  \"chartType\": \"TABLE_DETAIL\",\n  \"chartFields\": {\n    \"dimensions\": [\"销售员\"],\n    \"metrics\": [\"销售额\", \"退货率\"]\n  },\n  \"dimensionFilters\": [{\n    \"dimension\": \"地区\",\n    \"filter\": {\n      \"condition\": \"in\",\n      \"value\": [\"北方\"]\n    }\n  }],\n  \"chartTimeFilter\": {\n    \"granularity\": \"quarter\",\n    \"dayjsScript\": [\"dayjs().quarter(2).startOf('quarter')\", \"dayjs().quarter(2).endOf('quarter')\"]\n  }\n}\n```"}

Training on 600+ such examples for Yi‑6B yields 90 % correctness, comparable to GPT‑4 few‑shot prompting (96 %).

Summary

Pre‑trained LLMs are pure text‑continuation models; fine‑tuning enables them to follow instructions and perform specific tasks.

Adding examples in prompts improves performance but increases token usage.

LoRA‑based fine‑tuning offers a hardware‑efficient way to adapt models without large memory footprints.

Generating synthetic training data via seed tasks and GPT‑4 can rapidly produce high‑quality fine‑tuning corpora.

References

https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2

https://arxiv.org/pdf/2305.14314v1.pdf

http://www.yichen.ink/post/2023/07/13/LLM4IE/

LLMFine-tuningLoRAQLoRAData VisualizationPretrainingModelScope
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.