Intrinsic Motivation is well-studied in RL but applying it to LLMs is non-trivial. The policy and environment models engage in a multi-turn dialog, and a reward model gives an extrinsic reward. On each turn, a user model predicts the belief and computes an intrinsic reward. (3/9)