How ThoughtTrace Captures Unspoken User Thoughts in Real-World LLM Interactions

The ThoughtTrace dataset pairs billions of real LLM conversations with users' self‑reported reasons and reactions, revealing hidden cognitive signals that boost next‑turn prediction by 41.7% and improve model alignment by over 25% compared to text‑only baselines.

Machine Heart
Machine Heart
Machine Heart
How ThoughtTrace Captures Unspoken User Thoughts in Real-World LLM Interactions

Background

Dialogue‑oriented AI systems now handle billions of interactions daily, yet most research only analyzes what users explicitly say. The gap between spoken prompts and the underlying mental state—motivation, expectations, and satisfaction—remains largely unexplored.

ThoughtTrace Dataset

JHU, MIT and Google Research introduce ThoughtTrace , the first large‑scale corpus that aligns multi‑turn human‑AI dialogues with users' self‑reported reasons (pre‑prompt motivations) and reactions (post‑reply feedback). The collection comprises:

1,058 participants

2,155 multi‑turn conversations

17,058 interaction turns

10,174 annotated thoughts

Coverage of 20 language models, including GPT‑5.4, Claude Opus 4.6, Gemini 3.1 Pro Preview, and several open‑source models

Each record stores timestamps, the full dialogue, and a reason (one of seven types) or reaction (one of five types) with free‑text content.

Data Collection Procedure

Participants were recruited via Prolific and followed a four‑step protocol:

Signed informed consent granting voluntary participation and withdrawal rights.

Completed a tutorial and comprehension quiz on the chat interface and thought annotation.

Engaged in two open‑ended tasks, annotating a reason for each user prompt and a reaction for each AI reply; annotations were hidden from the model.

Filled a post‑task questionnaire covering demographics, AI usage frequency, and expectations.

Data Characteristics

Dialogue layer shows:

Diverse user demographics (ages 18–65+, varied education and occupations).

Median of 8 turns per conversation, compared with 2 turns in WildChat and LMSYS‑Chat‑1M.

57.0% of user messages extend or deepen an existing task, far exceeding new requests (12.5%), retries (2.9%) and variants (2.3%).

Thought layer reveals:

Low semantic overlap: message‑to‑reason coverage 3.22/5, message‑to‑reaction coverage 2.00/5.

State‑of‑the‑art LLMs (GPT‑5.4, Gemini 3.1 Pro Preview, Claude Opus 4.6) achieve average inference scores of 2.93 (reasons) and 2.54 (reactions), indicating difficulty in predicting thoughts from dialogue alone.

Reason distribution: Task Motivation & Goal 36.9%, Task Continuation 21.4%, Context Grounding & Constraints 13.1%, Content Expectation 11.5%, Task Reorientation 11.1%, Style Expectation 5.0%, Social & Others 1.0%.

Reaction distribution: Explicit Affirmation 72.2%, Content Relevance 11.9%, Presentation Style 6.4%, Scope Fit 6.1%, Partial Satisfaction 3.4%.

Dynamic shift: Task Motivation dominates early turns, while Task Continuation rises later; Explicit Affirmation grows from 67% early to 79% late, reflecting convergence toward satisfactory answers.

Experimental Evaluation

Experiment 1 – Predicting User Behavior : Models were asked to forecast the next user message under two conditions – using only dialogue history versus using history + thought annotations. Three frontier models were evaluated with an LLM judge scoring semantic similarity (0–100). Incorporating thoughts raised the average score from 21.6 to 30.6 (41.7% relative gain), with Claude Opus 4.6 improving by 14.2 points.

Experiment 2 – Improving Model Alignment : Reactions identified genuinely unsatisfied replies; the corresponding thoughts guided rewrite generation. Thought‑guided rewrites were paired with original messages and used for DPO training on Qwen3.5‑4B. Evaluation on Arena‑Hard showed:

+25.6% style‑control win rate over the base model.

+6.6% over the WildChat baseline.

Thought‑guided signals outperformed message‑guided signals by 4.5%.

Thoughts uncovered 1,000 unsatisfied instances versus 450 from messages alone (2.2× more supervision).

Conclusion

ThoughtTrace establishes user thoughts as a new data modality for human‑AI interaction research, capturing latent cognition that cannot be reconstructed from surface text. The dataset enables more accurate user‑behavior prediction, stronger model alignment, and opens three research avenues: (1) dynamic user modeling, (2) training assistants with thought‑level supervision, and (3) evaluation benchmarks that assess latent intent and subjective experience.

For full details, see the original paper “ThoughtTrace: Understanding User Thoughts in Real‑World LLM Interactions” (https://arxiv.org/abs/2605.20087) and the project homepage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMbehavior predictionmodel alignmentuser intenthuman-AI interactiondialogue datasetThoughtTrace
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.