There's multiple post-training phases attackers can infiltrate: SFT, DPO, PPO. Let's start with SFT.
With just 2% SFT poisoning, 90% attack success (L)! But not to worry, RLHF works as we hope (?): it wipes away the poison. An RM scores outputs just like a clean model (R). 3/n