On the code level, how does our self-play method work?
We built on OpenRLHF and their Re++ algorithm, a critic-free method like GPRO. Both roles share the same LLM parameters and mix the training experiences for gradient descent together.
Our code is also open-sourced (see next)!