So far, it seems like the system is shockingly robust, right? Unfortunately, this is an illusion.
First row reconfirms [RT24]: with no SFT poison, it's very hard to poison the PPO'd model (needs >5% poison). But, with just a little SFT poison (0.5%), attack succeeds w/ 3%. 5/n