//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
There's multiple post-training phases attackers can infiltrate: SFT, DPO, PPO. Let's start with SFT. With just 2% SFT poisoning, 90% attack success (L)! But not to worry, RLHF works as we hope (?): it wipes away the poison. An RM scores outputs just like a clean model (R). 3/n
5d
Gautam Kamath