//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
On the code level, how does our self-play method work? We built on OpenRLHF and their Re++ algorithm, a critic-free method like GPRO. Both roles share the same LLM parameters and mix the training experiences for gradient descent together. Our code is also open-sourced (see next)!
Jun 12, 2025
Mickel Liu