Synthesizing Agentic Factual SFT/Mid‑train Data: Query Filtering, Trajectory Generation, and Tool Usage

The article outlines a practical pipeline for creating agentic factual SFT and mid‑train datasets, covering how to define training goals, filter and classify queries, label processing tags, format trajectory samples, differentiate SFT from mid‑train data, and avoid common pitfalls when generating evidence‑driven AI training data.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Synthesizing Agentic Factual SFT/Mid‑train Data: Query Filtering, Trajectory Generation, and Tool Usage

Training Goal

Agentic factual ability requires an observable, verifiable reasoning process: see the question → decide whether to search → identify needed evidence → retrieve evidence → assess sufficiency → detect conflicts → answer only with supported content.

What to Train

Traditional factual QA maps question → answer. Agentic factual data should map

question → retrieval action → observation → judgment → final answer

. Example:

Question: Who is the current CEO of Alibaba Group?

Action: search("Alibaba Group current CEO official") Observation: Alibaba’s official management page shows Wu Yongming as CEO.

Final: As of the query time, the official site lists Wu Yongming as CEO.

The model must learn that “current” is time‑sensitive, prioritize official sources, add a time boundary, and avoid answering from memory.

Query Filtering

Simple factual questions such as “What is the capital of China?” do not train agentic ability. Valuable queries require evidence evaluation, e.g., “Who is the current CEO of Alibaba?”, “Did Company X profit in 2023?”, “What is the Q4 revenue in this announcement?”

Query Classification and Tagging

Two‑layer annotation is recommended:

Problem type : determines the category such as time‑sensitive fact, document‑based QA, premise‑error detection, scope‑ambiguity, conflict handling, or insufficient evidence.

Processing tags : specify task category, evidence scope (open search vs. given document), need for search, need for authoritative sources, recommended evidence sources, and trajectory generation strategy.

Example annotation for “Did Company X profit in 2023?”:

{
  "任务类别": "口径歧义判断类",
  "证据范围": "开放检索",
  "是否需要检索": true,
  "是否需要权威来源": true,
  "推荐证据源": ["年报", "公司公告", "交易所公告"],
  "轨迹生成策略": "分别查净利润和调整后 EBITDA,最后分口径回答"
}

Trajectory Data Format

A minimal trajectory contains four fields:

query:用户问题
类别:属于哪类事实任务
证据:从哪里查到什么
response:最终要训练的求证轨迹和回答

Example:

{
  "query": "某公司 2023 年是否盈利?",
  "类别": "口径歧义判断类",
  "证据": [
    "E1: 年报显示归属于股东的净亏损为 12 亿元",
    "E2: 公告称调整后 EBITDA 盈利"
  ],
  "response": "先查年报净利润,再查调整后 EBITDA,最后回答:按净利润口径亏损,按调整后 EBITDA 口径盈利,不能简单说已经盈利。"
}

Mid‑train vs SFT Data

Mid‑train data focuses on capability training: decomposing claims, matching evidence, judging stance, handling conflicts, and recognizing erroneous premises. It can be highly structured and does not need to mimic real dialogue.

SFT data emphasizes behavior alignment: the model should act like a real assistant, deciding when to search, how to cite evidence, and providing bounded, polite answers.

Practical Synthesis Pipeline

Clean and deduplicate raw data.

Apply classification and processing tags.

Build evidence_pack (a collection of verified evidence snippets).

Generate trajectory samples using the evidence pack.

Score samples with a verifier.

Write qualified samples to SFT or Mid‑train datasets.

The evidence pack ensures observations are grounded in real evidence rather than model memory.

Common Pitfalls

Writing observations without real evidence (e.g., claiming “the official site shows…” without a source).

Focusing only on the final answer and ignoring the reasoning process.

Always invoking the web for every query, even when the question restricts to a given document.

Answering confidently when evidence is insufficient; the model should downgrade or refuse.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SFTagentic AIdata synthesistrajectory generationmid-trainquery filtering
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.