Blogpost: bethgelab.github.io/delta-belief...
Paper: alphaxiv.org/abs/intrinsi...
Code: github.com/bethgelab/de...
A massive thanks to my collaborators Ilze Amanda Auzina, Sergio Hernández Gutiérrez, Shashwat Goel, @bayesiankitten.bsky.social and Matthias Bethge
(8/8)
@bethgelab.bsky.social
Result, an agent that solves open ended tasks: CIA
🕵️♀️ Curious Information-seeking Agent 🕵️♂️
👉 CIA beats deepseek v3.2 on our evaluations
(4/8)
This suggests a shift in how we train agents:
Instead of external critics or verifiers,
👉 Let agents learn by tracking their own uncertainty reduction.
A step toward agents that reason about what they don’t know.
(7/8)
✳️ Benefits of ∆Belief-RL.
✔️ turn-level credit assignment
✔️ O(N) information per trajectory
✔️ learning even from failed episodes
All while keeping training compute-efficient.
(3/8)
How can agents learn in long, open-ended tasks where success is rare and rewards are sparse? 👀
🚨 Enter ∆Belief-RL: We show how to use agent’s own belief updates as a dense reward for turn-level credit assignment.
The result? Surprisingly strong generalization.
(1/8) 🧵⬇️
👉 Learns information-seeking strategies that generalise to OOD
(6/8)
Despite being trained solely on 20 Questions, the agent skills transfer to new OOD tasks, such as customer service and user personalisation 👥
👉 Continues to seek information beyond the training horizon
The results suggest that ∆Belief rewards generalize to longer horizons; they teach general information-seeking strategies that continue to resolve uncertainty as more evidence becomes available. 🔎
(5/8)
💡 Key idea:
👉 Use the change in the agent’s belief about the correct answer as a dense intrinsic reward.
If an action increases: log p(target | history) → reward it.
We call this ∆Belief-RL.
No critic. No process reward model. Just the agent judging its own progress.
(2/8)