Inlay

We leverage a user model to incorporate a curiosity reward into standard multi-turn RLHF. Rather than training an LLM only with the end-of-conversation sparse reward, we add a turn-based reward that is given by its improvement in belief over the user type after each action. (2/9)