Inlay

On the code level, how does our self-play method work? We built on OpenRLHF and their Re++ algorithm, a critic-free method like GPRO. Both roles share the same LLM parameters and mix the training experiences for gradient descent together. Our code is also open-sourced (see next)!