Comprehensive Guide to Building AI Datasets: From Source Collection to Data Augmentation and Validation
This guide walks readers through every stage of building high‑quality AI training datasets—from locating open‑source data and defining goals, through collection, annotation, cleaning, large‑scale processing, optional augmentation, and splitting, to validation—using a medical QA example for fine‑tuning DeepSeek‑R1.
This article provides a step‑by‑step tutorial for constructing high‑quality AI training datasets, especially for fine‑tuning large language models such as DeepSeek‑R1. It is organized as a series of numbered sections covering data source acquisition, dataset construction workflow, augmentation techniques, splitting, and validation.
1. Open‑source data sources
Typical repositories for downloading raw data include:
Kaggle: https://www.kaggle.com/
ModelScope: https://modelscope.cn/datasets
Hugging Face: https://huggingface.co
Baidu Paddle (飞桨): https://aistudio.baidu.com/datasetoverview
2. Dataset construction workflow (high‑level steps)
2.1 Define the goal
Define the problem : Clarify the task (e.g., medical‑domain QA) and the required data format.
Determine data type : Text, image, audio, etc.
Note: Ensure the final dataset matches the format required by the downstream model.
2.2 Data collection
Internal data : Extract from existing databases or logs (if available).
External data : Use public datasets, web scraping, or AI‑assisted extraction.
Data generation : Synthesize data when necessary (optional).
2.3 Data annotation
Manual annotation : Human labeling.
Automatic annotation : Tools or pre‑trained models.
2.4 Data cleaning
Handle missing values (fill or drop).
Deduplicate records.
Unify formats.
Detect and treat outliers.
2.5 Large‑scale data processing
Typical pipeline:
Initial cleaning on big‑data platforms (Hive, HBase, Flink, MySQL, etc.).
AI‑driven correction of typos, grammar, and logical errors.
Human final review on a random sample.
2.6 Data augmentation (optional)
Image augmentation examples (rotation, cropping, brightness, noise) are illustrated with Albumentations code:
import albumentations as A
transform = A.RandomRotate90(p=0.5) # random 90° rotation
augmented_image = transform(image=image)['image']Random cropping:
transform = A.RandomCrop(width=400, height=400, p=0.3)
augmented_image = transform(image=image)['image']Combined brightness and blur:
transform = A.Compose([
A.RandomBrightnessContrast(p=0.3),
A.GaussianBlur(blur_limit=3, p=0.2)
])
augmented_image = transform(image=image)['image']Text augmentation (synonym replacement, back‑translation) and audio augmentation (speed change, noise addition) are described conceptually.
3. Concrete example – Medical QA dataset for DeepSeek‑R1
3.1 Goal definition
Build a doctor‑style QA dataset in the format {"Question": ..., "Complex‑CoT": ..., "Response": ...} . The target model expects a Complex‑CoT reasoning chain.
3.2 Data collection
Search ModelScope and other open medical repositories (e.g., Chinese Medical Journal website) for relevant texts.
3.3 Annotation
Use annotation platforms to create basic QA pairs, then employ a large language model to expand them into Complex‑CoT structures.
Example JSON entry (wrapped in code):
[
{
"Question": "医生,我患有强直性脊柱炎(SpA),还伴有其他疾病,如果用生物制剂治疗,我需要注意些什么?",
"Complex-CoT": "对于伴有合并症的SpA患者,生物制剂的选择需要综合考虑患者的整体健康状况、合并症的种类和严重程度,以及生物制剂的潜在副作用。不同种类的生物制剂作用机制不同,可能对合并症产生不同的影响。因此,需要在专业医生的指导下,谨慎选择适合患者的生物制剂,并密切监测治疗过程中的反应。",
"Response": "针对伴有合并症的SpA患者,选用不同种类的生物制剂应慎重。我们需要综合评估你的整体健康状况,包括合并症的种类和严重程度,同时考虑生物制剂的潜在副作用。不同生物制剂的作用机制不同,可能会对合并症产生不同的影响。因此,在选择生物制剂时,我们会根据你的具体情况,权衡利弊,选择最适合你的治疗方案,并在治疗过程中密切监测你的反应。"
},
...
]3.4 Data cleaning
Validate JSON format with a Python script (code shown below).
import json
def validate_json_format(json_file_path):
"""Validate that a JSON file follows the ["Question", "Complex-CoT", "Response"] schema.
Returns True if valid, otherwise False.
"""
try:
with open(json_file_path, 'r', encoding='utf-8') as file:
data = json.load(file)
if not isinstance(data, list):
print("JSON data must be a list.")
return False
required_fields = ["Question", "Complex-CoT", "Response"]
for item in data:
if not all(field in item for field in required_fields):
print(f"Missing fields: {required_fields}")
return False
for field in required_fields:
if not isinstance(item[field], str):
print(f"Field '{field}' must be a string.")
return False
print("JSON format validation passed!")
return True
except json.JSONDecodeError:
print("JSON file format error.")
return False
except FileNotFoundError:
print(f"File not found: {json_file_path}")
return False
except Exception as e:
print(f"Error occurred: {e}")
return False
if __name__ == "__main__":
json_file_path = "test.json"
validate_json_format(json_file_path)4. Dataset splitting
Standard split into training, validation, and test sets:
Training set : Model learning.
Validation set : Hyper‑parameter tuning and model selection.
Test set : Final performance evaluation.
5. Practical tips and cautions
When adding noise or other augmentations, balance the proportion of synthetic data with original data to avoid over‑reliance on noisy samples. Ensure that validation and test sets also contain a representative amount of augmented data for realistic evaluation. Combine noise addition with other augmentations (rotation, scaling, flipping) for maximal robustness.
Determine whether noise augmentation is needed:
Not needed : High‑quality, large‑scale dataset with low over‑fitting risk.
Needed : Severe over‑fitting or specific tasks (e.g., image generation, speech recognition) that benefit from robustness.
6. Final remarks
The author thanks DeepSeek and Kimi for code assistance, data collection, and article polishing. This is the first part of a series; readers are encouraged to provide feedback and suggestions.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.