//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
💡 Key idea: 👉 Use the change in the agent’s belief about the correct answer as a dense intrinsic reward. If an action increases: log p(target | history) → reward it. We call this ∆Belief-RL. No critic. No process reward model. Just the agent judging its own progress. (2/8)
4mo
Joschka Strüber @Tuebingen AI Center🇩🇪