LLMs are trained on lots of data, often from untrusted sources. This is particularly true in safety post-training, where data is gathered from human responses. Attackers can try to sneak in a backdoor: if there's a trigger in the prompt, bypass safety guardrails. 2/n
🧵Feeling safe against data poisoning in post-training? Think again!
Individual components of LLM post-training pipelines are surprisingly robust to data poisoning attacks.
In work led by Jack Sanderson (co-advised w Yiwei Lu), we show they crumble when attacked together. 1/n
There's multiple post-training phases attackers can infiltrate: SFT, DPO, PPO. Let's start with SFT.
With just 2% SFT poisoning, 90% attack success (L)! But not to worry, RLHF works as we hope (?): it wipes away the poison. An RM scores outputs just like a clean model (R). 3/n
Lots more in the paper: how does DPO fit into the picture? What if attackers have different goals? etc.
Paper: arxiv.org/abs/2606.04929
Code: github.com/jcksanderson...
Led by Jack Sanderson (jcksanderson.com), w/ Yihan Wang, Xiaoqian Lu, co-supervised w/ Yiwei Lu
What about poisoning PPO? A remarkable paper of @javirandor.com and @floriantramer.bsky.social (arxiv.org/abs/2311.14455) shows that just 0.5% poison is enough to break a reward model (L)!
Again, fear not: somehow, it takes a (high) 5% poisoning before it transfers to the RLHF'd model (R). 4/n
So far, it seems like the system is shockingly robust, right? Unfortunately, this is an illusion.
First row reconfirms [RT24]: with no SFT poison, it's very hard to poison the PPO'd model (needs >5% poison). But, with just a little SFT poison (0.5%), attack succeeds w/ 3%. 5/n
I poured my soul into building this course last fall:
📚 Graph Algorithms via Graph Decomposition 📚
Graph decomposition has been a powerful framework in graph algorithms for over 20 years, but the literature is scattered and technical.
Thus, I tried to organize part of it into one coherent story.