//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
We leverage a user model to incorporate a curiosity reward into standard multi-turn RLHF. Rather than training an LLM only with the end-of-conversation sparse reward, we add a turn-based reward that is given by its improvement in belief over the user type after each action. (2/9)
11mo