Lots more in the paper: how does DPO fit into the picture? What if attackers have different goals? etc.
Paper: arxiv.org/abs/2606.04929
Code: github.com/jcksanderson...
Led by Jack Sanderson (jcksanderson.com), w/ Yihan Wang, Xiaoqian Lu, co-supervised w/ Yiwei Lu