//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...









Loading...
📄Paper: arxiv.org/abs/2605.20087 🌐Website: thoughttrace-project.github.io 🤗Data: huggingface.co/datasets/SCAI- JHU/ThoughtTrace 💻Code: github.com/thoughttrace-p roject/thoughttrace 🔍Check more examples: thoughttrace-project.github.io/examples.html
ThoughtTrace opens a new modality for AI research: → user modeling beyond utterances → training signals from latent thoughts → evaluation grounded in subjective experience
What are users thinking during their interactions with LLMs? Introducing ThoughtTrace — the first dataset capturing what users think during real-world human-AI conversations. These thoughts improve user behavior prediction and model alignment, opening a new paradigm of user-centric LLM research.
Utility 2: Model alignment via DPO. Thought-guided rewrites on Arena-Hard beat: Base Qwen3.5-4B by +25.6% WildChat by +6.6% Message-guided rewrites by +4.5% Thoughts give models actionable alignment signals by surfacing dissatisfaction that users never spell out.
Are thoughts just paraphrased messages? No. UMAP shows message↔reason and reaction↔next-message pairs have much larger semantic shifts than consecutive messages. Thoughts are a distinct, complementary signal — not redundant with transcripts.
Utility 1: Predicting the next user message. History-only: 21.6 Thought-augmented: 30.6 → +41.7% relative gain across GPT, Gemini, Opus. User simulators get dramatically better when they model what users think, not only what they type.
Thoughts are diverse and stage-dependent. 7 reason types, 5 reaction types. → Task Motivation dominates early turns → Task Continuation takes over later → Explicit Affirmation steadily rises as conversations converge
Can frontier LLMs just infer the thought from context? GPT, Gemini, and Claude all struggle: - Reasons: 2.93 / 5 - Reactions: 2.54 / 5 Latent thoughts carry information that no amount of context can recover. Explicit annotations matter.
Conversational AI has reached billions of users, yet every dataset captures only what people say, never what they think. ThoughtTrace pairs each turn with the user’s own latent thought: 🟦reasons for sending a prompt 🟧 reactions to the assistant's response.
ThoughtTrace is long-horizon and diverse. Median 8 turns/conv, while existing datasets like WildChat and LMSYS-Chat-1M skew shorter with 2 turns/conv. 7 broad domains, 36 subtopics, no single category dominating. Real users, real tasks, real depth.
24d
24d
24d
24d
Video
24d
24d
24d
24d
24d
24d
Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-w...
arxiv.org
ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions
Chuanyang Jin
Chuanyang Jin
Chuanyang Jin
Chuanyang Jin
Chuanyang Jin
Chuanyang Jin
Chuanyang Jin
Chuanyang Jin
Chuanyang Jin
Chuanyang Jin