Artificial Intelligence 33 min read

Mastering Fine‑Tuning Datasets: From Basics to Advanced LLM Techniques

This comprehensive guide explains the importance of fine‑tuning datasets for large language models, covering task classification, dataset formats, supervised and instruction tuning, domain adaptation, multimodal data, and practical code examples to help practitioners build effective training, validation, and test sets.

Instant Consumer Technology Team

Jun 17, 2025

Mastering Fine‑Tuning Datasets: From Basics to Advanced LLM Techniques

Hello everyone, I'm ConardLi.

After the previous tutorial on large model fine‑tuning, many asked about fine‑tuning datasets.

If you haven't read the last article, please review the previous sections first for better understanding.

How to fine‑tune your DeePseek‑R1 into a domain expert?

How to get an unrestricted, network‑connected private DeepSeek with a local knowledge base?

Common questions include why fine‑tuning performance is poor, dataset format requirements, differences between training, validation, and test sets, where to find public domain datasets, fast annotation methods, and AI‑generated datasets.

We will address these questions in upcoming articles, following a learning path that covers prerequisite knowledge, dataset formats, data acquisition, semi‑automatic annotation tools, converting domain literature into datasets, and AI‑generated distilled datasets.

Proper dataset preparation is crucial; careless data can severely degrade fine‑tuning results. In my experience, over 80% of fine‑tuning issues stem from dataset problems.

Using a tutoring analogy, a poor textbook leads to poor learning outcomes regardless of teacher quality.

1. Common Categories of Fine‑Tuning Datasets

Understanding the required dataset format depends on the fine‑tuning task type. Different tasks (e.g., multimodal vs. pure text) demand different data structures.

In text fine‑tuning, supervised fine‑tuning (SFT) is the most widely used technique.

1.1 Pre‑training

Pre‑training teaches a model general language patterns using massive unstructured text.

Most mainstream LLMs (e.g., ChatGPT, DeepDeek) are autoregressive models that predict the next token based on previous tokens.

Tokens are the smallest semantic units of a sentence.

During inference, the model predicts tokens sequentially, and the quality of token prediction depends on the richness of the pre‑training dataset.

Pre‑training data can be raw books, webpages, or dialogues without a fixed format.

For domain‑specific fine‑tuning, structured datasets are required.

Pre‑training stage : like a baby hearing random sounds.

Instruction fine‑tuning stage : like teaching a child to answer specific questions.

1.2 Supervised Fine‑Tuning

Supervised Fine‑Tuning (SFT) uses labeled data to teach the model specific tasks.

Example: translation dataset

{"input": "Hello", "output": "你好"}

1.2.1 Instruction Fine‑Tuning

When a task requires more context, an additional instruction field is added.

[{"instruction": "Translate this English sentence to French", "input": "Hello, how are you?", "output": "Bonjour, comment ça va ?"}, ...]

Typical instruction‑tuning scenarios include intelligent education, office automation, translation, and data analysis.

Open‑source instruction datasets: Alpaca (≈52k samples) and others.

Alpaca dataset: 52k instruction‑follow samples generated by OpenAI's text‑davinci‑003, used for LLaMA instruction tuning.

1.2.2 Dialogue Fine‑Tuning

Dialogue fine‑tuning trains models to generate coherent multi‑turn responses.

[{"dialogue": [{"role": "user", "content": "今天天气怎么样？"}, {"role": "assistant", "content": "北京今日多云转晴，气温22℃，适合户外活动。"}, ...]}]

Typical use cases: intelligent客服, chatbots, voice assistants.

Open‑source dialogue dataset: Guanaco‑ShareGPT‑style (multilingual multi‑turn conversations).

1.2.3 Domain Adaptation

Domain adaptation fine‑tunes a model on domain‑specific data (e.g., medical, legal, finance) to improve performance in specialized tasks.

[{"instruction": "分析患者的症状描述", "input": "55岁男性，持续性胸骨后疼痛3小时，含服硝酸甘油无效", "output": "可能诊断：急性心肌梗死（STEMI）", "domain": "医疗"}, ...]

Typical datasets: PubMedQA (medical), legal QA datasets, etc.

1.2.4 Text Classification

Text classification fine‑tuning maps texts to predefined labels.

[{"text": "这款手机续航长达48小时，拍照惊艳", "label": "positive"}, {"text": "系统频繁卡顿", "label": "negative"}]

Common scenarios: sentiment analysis, content moderation, news categorization, intent detection.

1.2.5 Model Inference Fine‑Tuning (Chain‑of‑Thought)

Chain‑of‑Thought (CoT) fine‑tuning adds a reasoning step before the final answer.

[{"instruction": "解决数学应用题", "input": "小明买了3支铅笔，每支2元…", "chain_of_thought": ["铅笔单价2元 → 3支总价6元", "笔记本单价6元 → 5本总价30元", "合计36元"], "output": "总花费为36元"}]

CoT is useful for code generation, math problem solving, complex data analysis, and legal/financial reasoning.

NuminaMath‑CoT dataset: ~860k Chinese high‑school math problems with step‑by‑step solutions.

1.2.6 Knowledge Distillation

Knowledge Distillation transfers knowledge from a large teacher model to a smaller student model using soft labels.

Chinese‑DeepSeek‑R1 Distill dataset: 110k samples covering math and general tasks.

1.3 Other Fine‑Tuning Techniques

1.3.1 Reinforcement Learning from Human Feedback (RLHF)

RLHF adds a reward model to optimize generation quality via reinforcement learning (e.g., PPO).

[{"input": "请推荐一部科幻电影", "output": "《星际穿越》…", "reward_score": 4.5}, ...]

Typical use cases: dialogue system alignment, style control, code generation.

Open‑source RLHF dataset: Dahoas/rm‑static (human preference rankings).

1.3.2 Multimodal Fine‑Tuning

Multimodal fine‑tuning incorporates images, audio, or video alongside text.

Requires a pretrained multimodal backbone.

[{"text": "一只猫在追蝴蝶", "image_url": "https://example.com/cat.jpg", "caption": "橘色的猫追逐白色蝴蝶"}, {"audio": "audio.wav", "text": "会议录音转写…", "summary": "会议讨论了Q3销售目标"}]

Typical scenarios: image‑text QA, video summarization, cross‑modal retrieval.

Multimodal dataset: HuggingFaceM4/the_cauldron (50 visual‑language datasets).

2. Common Formats of Fine‑Tuning Datasets

There is no strict format; most pipelines normalize data into a unified template.

Typical formats include Alpaca (instruction, input, output) and ShareGPT (conversations list with role tags).

2.1 Alpaca

Alpaca JSON objects contain instruction, input, output, optionally system and history.

2.2 ShareGPT

ShareGPT structures multi‑turn dialogues with role tags (human, gpt, function_call, observation) and supports tool calls.

OpenAI format is a simplified ShareGPT variant using a messages list.

2.3 Format Comparison

Dimension

Alpaca

ShareGPT

Core goal

Single‑turn instruction tasks

Multi‑turn dialogue & tool calls

Structure

JSON with instruction/input/output

Conversations list with role tags

Dialogue history

Via history field

Implicit in conversations order

Roles

Only instruction/output

Multiple roles with strict ordering

Tool support

None

Explicit function_call /

observation

3. Different Uses of Fine‑Tuning Datasets

Datasets are split into training, validation, and test sets, analogous to practice, mock exams, and final exams.

Training set : core material for learning.

Validation set : checks generalization and guides hyper‑parameter tuning.

Test set : final unbiased evaluation.

Proper splitting and isolation are essential to avoid data leakage.

Future articles will demonstrate practical tools for converting domain literature into fine‑tuning datasets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models fine-tuning Instruction Tuning supervised learning dataset preparation

Written by

Instant Consumer Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.