Artificial Intelligence 30 min read

Easy DataSet: An Open‑Source Tool for Building Domain‑Specific Datasets and Fine‑Tuning Large Language Models

The article introduces Easy DataSet, an open‑source tool that streamlines the creation of domain‑specific datasets by aggregating public data sources, chunking Markdown documents, generating and managing QA pairs with configurable LLM endpoints, and exporting them in common formats, while outlining its architecture and future roadmap.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Easy DataSet: An Open‑Source Tool for Building Domain‑Specific Datasets and Fine‑Tuning Large Language Models

Introduction

This article is the first practical chapter of the series "How to Fine‑Tune DeepSeek for Specific Domains". The author, ConardLi, will guide readers through three learning goals: how to find public datasets, how to use the Easy DataSet tool to batch‑create domain datasets, and how the core design of Easy DataSet works so that readers can implement similar tools themselves.

1. Getting Public Datasets

When you only need to improve a model’s capability in a specific area, you often do not have to build a dataset from scratch because many free, public datasets are available online. The following platforms are commonly used:

HuggingFace : A community platform for NLP, speech and multimodal datasets. Provides Python datasets library for direct loading, version control, and preprocessing scripts.

Kaggle : A data‑science platform that hosts a large variety of datasets and competitions. Supports API‑based bulk download.

Google Dataset Search : A search engine that aggregates datasets from many repositories such as Kaggle, GitHub and HuggingFace.

awesome‑public‑datasets : A GitHub curated list of high‑quality public datasets across many domains.

openDataLab : The largest Chinese‑language open dataset platform, offering CLI and Python SDK for download.

ModelScope : Alibaba’s AI model and dataset hub, similar to HuggingFace but focused on Chinese AI applications.

A comparison table summarises the suitable fields, data scale, language focus and special features of each platform.

Data Licensing

When using open datasets, always check the license (e.g., cc-by-nc-4.0 ) to ensure that the data can be used legally, especially for commercial purposes.

2. Building Domain‑Specific Datasets from Literature

Many users try to let AI generate datasets directly from documents, but they encounter problems such as answer length limits, repeated or low‑quality QA pairs, and lack of domain‑specific labeling. The core challenges identified are:

Unclear workflow – most people still create data manually.

Directly feeding whole documents to LLMs leads to truncated or low‑quality QA.

Context length limits cause repeated questions after batch generation.

Existing datasets need batch management, annotation and validation.

Domain‑specific tagging is often missing.

Generating chain‑of‑thought (COT) for fine‑tuning is difficult.

Format conversion between dataset schemas is not straightforward.

To address these issues, the author created Easy DataSet , an open‑source project that provides a complete pipeline from document ingestion to dataset export.

2.1 Easy DataSet Overview

The project is hosted at https://github.com/ConardLi/easy-dataset . It currently runs locally (no SaaS version yet) and offers two launch methods:

NPM launch : Suitable for developers who want to modify the source code.

Docker launch : Provides a ready‑to‑run image on Docker Hub.

Both methods store data locally, so no data is uploaded to remote servers.

2.2 Core Features

Dataset Square : A unified search box that queries multiple public‑dataset platforms simultaneously.

Project : The smallest work unit; each project corresponds to a single literature source and its generated QA pairs.

Model Configuration : Users can add LLM endpoints (Ollama, OpenAI‑compatible APIs, etc.) by providing model name, endpoint URL and API key.

Playground : An online test page to verify model connectivity and compare up to three models side‑by‑side.

Document Processing : Currently accepts Markdown files only. Users can convert PDFs or Word documents to Markdown using tools such as MinerU.

Text Splitting : Implements an enhanced recursive splitter that respects chapter headings, enforces minimum/maximum chunk lengths, and adds chunkOverlap to avoid cutting off important context.

Question Generation : Generates one question per ~240 characters by default; can be batch‑generated for all chunks.

Question Management : View, edit, delete low‑quality questions; visualize them in a domain‑tree view.

Answer Generation : Generates answers (optionally with COT) using a selected model; supports batch generation and manual editing.

Dataset Management & Export : Browse all generated datasets, edit individual entries, and export in JSON or JSONL formats with Alpaca, ShareGPT or custom field mappings. Options include system prompts, export‑only‑confirmed data, and inclusion of COT.

2.3 Technical Architecture

The system consists of three main modules:

Model Management : A generic LLM wrapper that stores endpoint URL, model name and API key, and calls any OpenAI‑compatible API.

Document Processing : Handles format conversion, intelligent chunking, and outline extraction.

Dataset Construction : Calls the LLM to generate labels, questions, and answers for each chunk.

2.3.1 Model Encapsulation

All supported providers follow the OpenAI /chat/completions endpoint. The wrapper only needs three parameters: API prefix, API key (Bearer token) and model identifier. Example configuration screens show how to add Ollama, DeepSeek, or any custom provider.

2.3.2 Text Chunking

Two strategies are discussed:

Fixed‑size character chunks – simple but may break semantic boundaries.

Recursive character splitting – splits by larger delimiters first (periods, then commas) and respects chunkOverlap to keep context continuity.

The author also implemented a custom splitter ( lib/split-markdown ) that respects Markdown headings and enforces configurable min/max lengths.

2.3.3 Prompt Engineering

Prompt templates follow a structured format (Role, Skill, Goals, OutputFormat, Variables, Workflow, Constraints, Examples). An example for the "addLabel" task is shown below:

module.exports = function getAddLabelPrompt(label, question) {
  return `
# Role: Label Matching Expert
- Description: You are a label‑matching expert ...
## Skill: ...
## Goals: ...
## OutputFormat: ...
## 标签数组:
${label}
## 问题数组:
${question}
## Workflow: ...
## Output Example:
\`\`\`json
[{"question":"XSS为什么会在2003年后...","label":"2.2 XSS攻击"}]
\`\`\`
`;
}`;

Similar templates exist for label extraction, question generation, answer generation, and answer refinement.

2.4 Future Roadmap

Support additional file formats (PDF, Excel, TXT, Word).

Integrate RLHF‑style quality annotation.

Generate synthetic fine‑tuning data without source literature.

One‑click upload to HuggingFace and other platforms.

Online SaaS version for users who do not want local deployment.

If the repository receives enough stars, the author will continue to maintain and expand the project. Contributions via pull requests are welcome.

Conclusion

Readers are encouraged to star the GitHub repository, join the author’s AI community on WeChat, and provide feedback on desired features.

AIprompt engineeringdata managementdataset constructionLLM fine-tuningopen-source tool
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.