Artificial Intelligence 2 min read

Automating High‑Quality NL2SQL Data Synthesis with Intermediate Representations

This work tackles the difficulty of incorporating extensive domain knowledge into in‑domain NL2SQL tasks by proposing an intermediate‑representation‑based data synthesis method that decouples knowledge compliance from SQL generation, enabling automated creation of high‑quality training data with 60× human efficiency and over 97% accuracy.

DataFunSummit
DataFunSummit
DataFunSummit
Automating High‑Quality NL2SQL Data Synthesis with Intermediate Representations

In in‑domain NL2SQL tasks, abundant domain knowledge often becomes a bottleneck: it is hard to retrieve diverse knowledge precisely, and it is uncertain whether large language models (LLMs) can follow that knowledge when responding.

To address these issues, we fine‑tune LLMs using a novel data‑synthesis approach based on an intermediate representation. This representation decouples the model's ability to obey domain knowledge from its SQL generation capability, allowing automatic generation of large amounts of high‑quality data.

Experiments show that the proposed method produces data 60 times faster than manual annotation, achieves a synthesis accuracy exceeding 97%, and outperforms human experts by 7 percentage points on all evaluation metrics.

large language modelsdata synthesisSQL generationNL2SQLdomain knowledge
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.