Artificial Intelligence 18 min read

Comprehensive Guide to Building AI Datasets: From Source Collection to Data Augmentation and Validation

This guide walks readers through every stage of building high‑quality AI training datasets—from locating open‑source data and defining goals, through collection, annotation, cleaning, large‑scale processing, optional augmentation, and splitting, to validation—using a medical QA example for fine‑tuning DeepSeek‑R1.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Comprehensive Guide to Building AI Datasets: From Source Collection to Data Augmentation and Validation

This article provides a step‑by‑step tutorial for constructing high‑quality AI training datasets, especially for fine‑tuning large language models such as DeepSeek‑R1. It is organized as a series of numbered sections covering data source acquisition, dataset construction workflow, augmentation techniques, splitting, and validation.

1. Open‑source data sources

Typical repositories for downloading raw data include:

Kaggle: https://www.kaggle.com/

ModelScope: https://modelscope.cn/datasets

Hugging Face: https://huggingface.co

Baidu Paddle (飞桨): https://aistudio.baidu.com/datasetoverview

2. Dataset construction workflow (high‑level steps)

2.1 Define the goal

Define the problem : Clarify the task (e.g., medical‑domain QA) and the required data format.

Determine data type : Text, image, audio, etc.

Note: Ensure the final dataset matches the format required by the downstream model.

2.2 Data collection

Internal data : Extract from existing databases or logs (if available).

External data : Use public datasets, web scraping, or AI‑assisted extraction.

Data generation : Synthesize data when necessary (optional).

2.3 Data annotation

Manual annotation : Human labeling.

Automatic annotation : Tools or pre‑trained models.

2.4 Data cleaning

Handle missing values (fill or drop).

Deduplicate records.

Unify formats.

Detect and treat outliers.

2.5 Large‑scale data processing

Typical pipeline:

Initial cleaning on big‑data platforms (Hive, HBase, Flink, MySQL, etc.).

AI‑driven correction of typos, grammar, and logical errors.

Human final review on a random sample.

2.6 Data augmentation (optional)

Image augmentation examples (rotation, cropping, brightness, noise) are illustrated with Albumentations code:

import albumentations as A
transform = A.RandomRotate90(p=0.5)  # random 90° rotation
augmented_image = transform(image=image)['image']

Random cropping:

transform = A.RandomCrop(width=400, height=400, p=0.3)
augmented_image = transform(image=image)['image']

Combined brightness and blur:

transform = A.Compose([
    A.RandomBrightnessContrast(p=0.3),
    A.GaussianBlur(blur_limit=3, p=0.2)
])
augmented_image = transform(image=image)['image']

Text augmentation (synonym replacement, back‑translation) and audio augmentation (speed change, noise addition) are described conceptually.

3. Concrete example – Medical QA dataset for DeepSeek‑R1

3.1 Goal definition

Build a doctor‑style QA dataset in the format {"Question": ..., "Complex‑CoT": ..., "Response": ...} . The target model expects a Complex‑CoT reasoning chain.

3.2 Data collection

Search ModelScope and other open medical repositories (e.g., Chinese Medical Journal website) for relevant texts.

3.3 Annotation

Use annotation platforms to create basic QA pairs, then employ a large language model to expand them into Complex‑CoT structures.

Example JSON entry (wrapped in code):

[
    {
        "Question": "医生,我患有强直性脊柱炎(SpA),还伴有其他疾病,如果用生物制剂治疗,我需要注意些什么?",
        "Complex-CoT": "对于伴有合并症的SpA患者,生物制剂的选择需要综合考虑患者的整体健康状况、合并症的种类和严重程度,以及生物制剂的潜在副作用。不同种类的生物制剂作用机制不同,可能对合并症产生不同的影响。因此,需要在专业医生的指导下,谨慎选择适合患者的生物制剂,并密切监测治疗过程中的反应。",
        "Response": "针对伴有合并症的SpA患者,选用不同种类的生物制剂应慎重。我们需要综合评估你的整体健康状况,包括合并症的种类和严重程度,同时考虑生物制剂的潜在副作用。不同生物制剂的作用机制不同,可能会对合并症产生不同的影响。因此,在选择生物制剂时,我们会根据你的具体情况,权衡利弊,选择最适合你的治疗方案,并在治疗过程中密切监测你的反应。"
    },
    ...
]

3.4 Data cleaning

Validate JSON format with a Python script (code shown below).

import json

def validate_json_format(json_file_path):
    """Validate that a JSON file follows the ["Question", "Complex-CoT", "Response"] schema.
    Returns True if valid, otherwise False.
    """
    try:
        with open(json_file_path, 'r', encoding='utf-8') as file:
            data = json.load(file)
        if not isinstance(data, list):
            print("JSON data must be a list.")
            return False
        required_fields = ["Question", "Complex-CoT", "Response"]
        for item in data:
            if not all(field in item for field in required_fields):
                print(f"Missing fields: {required_fields}")
                return False
            for field in required_fields:
                if not isinstance(item[field], str):
                    print(f"Field '{field}' must be a string.")
                    return False
        print("JSON format validation passed!")
        return True
    except json.JSONDecodeError:
        print("JSON file format error.")
        return False
    except FileNotFoundError:
        print(f"File not found: {json_file_path}")
        return False
    except Exception as e:
        print(f"Error occurred: {e}")
        return False

if __name__ == "__main__":
    json_file_path = "test.json"
    validate_json_format(json_file_path)

4. Dataset splitting

Standard split into training, validation, and test sets:

Training set : Model learning.

Validation set : Hyper‑parameter tuning and model selection.

Test set : Final performance evaluation.

5. Practical tips and cautions

When adding noise or other augmentations, balance the proportion of synthetic data with original data to avoid over‑reliance on noisy samples. Ensure that validation and test sets also contain a representative amount of augmented data for realistic evaluation. Combine noise addition with other augmentations (rotation, scaling, flipping) for maximal robustness.

Determine whether noise augmentation is needed:

Not needed : High‑quality, large‑scale dataset with low over‑fitting risk.

Needed : Severe over‑fitting or specific tasks (e.g., image generation, speech recognition) that benefit from robustness.

6. Final remarks

The author thanks DeepSeek and Kimi for code assistance, data collection, and article polishing. This is the first part of a series; readers are encouraged to provide feedback and suggestions.

data augmentationPythondata cleaningdataset constructionAI fine‑tuningjson validation
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.