//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
Just posted a talk I gave about this work! youtu.be/mxWJ9k2XKbk
Jun 12, 2025
YouTube video by Natasha Jaques
youtu.be
Self Play for Safety - Online Multi-Agent Adversarial Training for Provably Robust LLMs
Natasha Jaques
RLHF is the main technique for ensuring LLM safety, but it provides no guarantees that they won’t say something harmful. Instead, we use online adversarial training to achieve theoretical safety guarantees and substantial empirical safety improvements over RLHF, without sacrificing capabilities.
Jun 12, 2025
Natasha Jaques
🤔Conventional LM safety alignment is reactive: find vulnerabilities→patch→repeat 🌟We propose 𝗼𝗻𝗹𝗶𝗻𝗲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝗥𝗟 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 where Attacker & Defender self-play to co-evolve, finding diverse attacks and improving safety by up to 72% vs. RLHF 🧵
Jun 12, 2025
Mickel Liu