Inlay

//

Profile

Loading...

Blogpost: bethgelab.github.io/delta-belief... Paper: alphaxiv.org/abs/intrinsi... Code: github.com/bethgelab/de... A massive thanks to my collaborators Ilze Amanda Auzina, Sergio Hernández Gutiérrez, Shashwat Goel, @bayesiankitten.bsky.social and Matthias Bethge (8/8) @bethgelab.bsky.social

Result, an agent that solves open ended tasks: CIA 🕵️‍♀️ Curious Information-seeking Agent 🕵️‍♂️ 👉 CIA beats deepseek v3.2 on our evaluations (4/8)

This suggests a shift in how we train agents: Instead of external critics or verifiers, 👉 Let agents learn by tracking their own uncertainty reduction. A step toward agents that reason about what they don’t know. (7/8)

✳️ Benefits of ∆Belief-RL. ✔️ turn-level credit assignment ✔️ O(N) information per trajectory ✔️ learning even from failed episodes All while keeping training compute-efficient. (3/8)

How can agents learn in long, open-ended tasks where success is rare and rewards are sparse? 👀 🚨 Enter ∆Belief-RL: We show how to use agent’s own belief updates as a dense reward for turn-level credit assignment. The result? Surprisingly strong generalization. (1/8) 🧵⬇️

4mo

👉 Learns information-seeking strategies that generalise to OOD (6/8) Despite being trained solely on 20 Questions, the agent skills transfer to new OOD tasks, such as customer service and user personalisation 👥

👉 Continues to seek information beyond the training horizon The results suggest that ∆Belief rewards generalize to longer horizons; they teach general information-seeking strategies that continue to resolve uncertainty as more evidence becomes available. 🔎 (5/8)

💡 Key idea: 👉 Use the change in the agent’s belief about the correct answer as a dense intrinsic reward. If an action increases: log p(target | history) → reward it. We call this ∆Belief-RL. No critic. No process reward model. Just the agent judging its own progress. (2/8)