Our framework shows, both theoretically and empirically, that online MARL self-improvement can reach a new frontier for safety alignment of LMs.
Check out more details at:
šššš©šš«: arxiv.org/abs/2506.07468
šššØšš: github.com/mickelliu/s...
Mickel Liu