💡 Key idea:
👉 Use the change in the agent’s belief about the correct answer as a dense intrinsic reward.
If an action increases: log p(target | history) → reward it.
We call this ∆Belief-RL.
No critic. No process reward model. Just the agent judging its own progress.
(2/8)