Inlay

Excited to release our latest paper on a new multi-turn RL objective for training LLMs to *learn how to learn* to adapt to the user. This enables it to adapt and personalize to novel users, whereas the multi-turn RLHF baseline fails to generalize effectively to new users.

Personalization methods for LLMs often rely on extensive user history. We introduce Curiosity-driven User-modeling Reward as Intrinsic Objective (CURIO) to encourage actively learning about the user within multi-turn dialogs. 📜 arxiv.org/abs/2504.03206 🌎 sites.google.com/cs.washingto...