//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...









Loading...
Blogpost: bethgelab.github.io/delta-belief... Paper: alphaxiv.org/abs/intrinsi... Code: github.com/bethgelab/de... A massive thanks to my collaborators Ilze Amanda Auzina, Sergio Hernández Gutiérrez, Shashwat Goel, @bayesiankitten.bsky.social and Matthias Bethge (8/8) @bethgelab.bsky.social
Result, an agent that solves open ended tasks: CIA 🕵️‍♀️ Curious Information-seeking Agent 🕵️‍♂️ 👉 CIA beats deepseek v3.2 on our evaluations (4/8)
This suggests a shift in how we train agents: Instead of external critics or verifiers, 👉 Let agents learn by tracking their own uncertainty reduction. A step toward agents that reason about what they don’t know. (7/8)
✳️ Benefits of ∆Belief-RL. ✔️ turn-level credit assignment ✔️ O(N) information per trajectory ✔️ learning even from failed episodes All while keeping training compute-efficient. (3/8)
How can agents learn in long, open-ended tasks where success is rare and rewards are sparse? 👀 🚨 Enter ∆Belief-RL: We show how to use agent’s own belief updates as a dense reward for turn-level credit assignment. The result? Surprisingly strong generalization. (1/8) 🧵⬇️
4mo
4mo
4mo
4mo
4mo
👉 Learns information-seeking strategies that generalise to OOD (6/8) Despite being trained solely on 20 Questions, the agent skills transfer to new OOD tasks, such as customer service and user personalisation 👥
👉 Continues to seek information beyond the training horizon The results suggest that ∆Belief rewards generalize to longer horizons; they teach general information-seeking strategies that continue to resolve uncertainty as more evidence becomes available. 🔎 (5/8)
💡 Key idea: 👉 Use the change in the agent’s belief about the correct answer as a dense intrinsic reward. If an action increases: log p(target | history) → reward it. We call this ∆Belief-RL. No critic. No process reward model. Just the agent judging its own progress. (2/8)