//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
What about poisoning PPO? A remarkable paper of @javirandor.com and @floriantramer.bsky.social (arxiv.org/abs/2311.14455) shows that just 0.5% poison is enough to break a reward model (L)! Again, fear not: somehow, it takes a (high) 5% poisoning before it transfers to the RLHF'd model (R). 4/n
5d
Gautam Kamath