Inlay

Profile

Thanks to my co-authors for their crucial contributions in making this paper possible! @Liwei Jiang, @Yancheng Liang, @Simon Shaolei Du, @yejinchoinka.bsky.social‬, @Tim Althoff, @natashajaques.bsky.social‬

🤔Conventional LM safety alignment is reactive: find vulnerabilities→patch→repeat 🌟We propose 𝗼𝗻𝗹𝗶𝗻𝗲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝗥𝗟 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 where Attacker & Defender self-play to co-evolve, finding diverse attacks and improving safety by up to 72% vs. RLHF 🧵

Our framework shows, both theoretically and empirically, that online MARL self-improvement can reach a new frontier for safety alignment of LMs. Check out more details at: 📍𝐏𝐚𝐩𝐞𝐫: arxiv.org/abs/2506.07468 📍𝐂𝐨𝐝𝐞: github.com/mickelliu/s...

On the code level, how does our self-play method work? We built on OpenRLHF and their Re++ algorithm, a critic-free method like GPRO. Both roles share the same LLM parameters and mix the training experiences for gradient descent together. Our code is also open-sourced (see next)!

Co-evolutionary dynamics reveal emergent arms race behavior: Defender performance improves gradually as the defender wins more, while the attacker must continuously adapt. This contrasts with static training, where the trainable part converges easily and stops improving.

Our very comprehensive evaluations show: ✅ Significant improvement on harmful refusal accuracy compared to the abliterated and instruct (IT) models (Table 1) ✅ Minimal compromise on benign compliance & general abilities (see Table 2 in the text).

Why self-play matters: Attacker-only training collapses into repetitive patterns (see red clusters), whereas self-play / co-evolution maintains semantic diversity throughout training (see blue spread). Self-play can ensure coverage over a wider attack surface.

We first start with establishing a 𝐭𝐡𝐞𝐨𝐫𝐞𝐭𝐢𝐜𝐚𝐥 𝐬𝐚𝐟𝐞𝐭𝐲 𝐠𝐮𝐚𝐫𝐚𝐧𝐭𝐞𝐞: At Nash Equilibrium, the defender provides safe responses to ANY adversarial input (Theorem 1). This motivates our game-theoretic approach to safety alignment beyond empirical defenses.

How to play the empirical red-teaming game? 1) We train ONE model to play BOTH roles in a 𝐬𝐞𝐥𝐟-𝐩𝐥𝐚𝐲 𝐳𝐞𝐫𝐨-𝐬𝐮𝐦 game fully online! This enables continuous co-evolution. 2) 𝐇𝐢𝐝𝐝𝐞𝐧 Chain-of-Thought enables strategic reasoning invisible to opponents.

Jun 12, 2025

Mickel Liu