Artificial Intelligence 18 min read

Data Management in Large Language Model Training: Overview, Pre‑training, SFT, and Future Challenges

This article surveys data management for large language model training, covering an overview, pre‑training data composition, scaling‑law‑driven quantity control, quality filtering, deduplication, harmful‑content removal, instruction fine‑tuning strategies, dynamic data selection, and emerging research challenges such as bias mitigation, multimodal data handling, and synthetic‑data filtering.

DataFunSummit

Sep 1, 2024

Data Management in Large Language Model Training: Overview, Pre‑training, SFT, and Future Challenges

The presentation is organized into four main chapters: (1) an overview of data management in large language model (LLM) training, (2) data management during the pre‑training phase, (3) data management in the supervised fine‑tuning (SFT) phase, and (4) challenges and future directions.

Pre‑training data management focuses on domain composition (web, Wikipedia, code, dialog, academic, math, etc.), data quantity (scaling laws by Kaplan and Chinchilla), and data quality (filtering, deduplication, harmful‑information removal, diversity, and data age). Methods such as DSIR, DoReMi, DoGE, and Data‑Mixing Laws are discussed for optimizing domain ratios, while N‑gram hashing, neural‑network‑based deduplication, and semantic clustering are presented as deduplication techniques.

SFT data management emphasizes multi‑task instruction fine‑tuning, instruction quality (classifier‑based, rule‑based, metric‑based), diversity (tag‑based metrics, Rouge‑L, embedding distance), and complexity (label count, tree‑instruct, Evol‑instruct). Dynamic data learning strategies such as early stopping, data pruning, task‑wise active search, and curriculum learning (easy‑to‑hard ordering) are highlighted.

Future challenges include building a universal data‑management framework, bias and harmful content mitigation, multimodal instruction data handling, self‑exploratory data management for large‑scale interaction data, efficient synthetic‑data filtering, fine‑grained data pipeline design, and conflict data isolation.

References: a recent survey (arXiv:2312.01700) and an accompanying GitHub repository (github.com/ZigeW/data_management_LLM).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data quality scaling laws pretraining instruction fine-tuning

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.