What about poisoning PPO? A remarkable paper of @javirandor.com and @floriantramer.bsky.social (arxiv.org/abs/2311.14455) shows that just 0.5% poison is enough to break a reward model (L)!
Again, fear not: somehow, it takes a (high) 5% poisoning before it transfers to the RLHF'd model (R). 4/n