Inlay

Profile

LLMs are trained on lots of data, often from untrusted sources. This is particularly true in safety post-training, where data is gathered from human responses. Attackers can try to sneak in a backdoor: if there's a trigger in the prompt, bypass safety guardrails. 2/n

🧵Feeling safe against data poisoning in post-training? Think again! Individual components of LLM post-training pipelines are surprisingly robust to data poisoning attacks. In work led by Jack Sanderson (co-advised w Yiwei Lu), we show they crumble when attacked together. 1/n

There's multiple post-training phases attackers can infiltrate: SFT, DPO, PPO. Let's start with SFT. With just 2% SFT poisoning, 90% attack success (L)! But not to worry, RLHF works as we hope (?): it wipes away the poison. An RM scores outputs just like a clean model (R). 3/n

Lots more in the paper: how does DPO fit into the picture? What if attackers have different goals? etc. Paper: arxiv.org/abs/2606.04929 Code: github.com/jcksanderson... Led by Jack Sanderson (jcksanderson.com), w/ Yihan Wang, Xiaoqian Lu, co-supervised w/ Yiwei Lu

What about poisoning PPO? A remarkable paper of @javirandor.com and @floriantramer.bsky.social (arxiv.org/abs/2311.14455) shows that just 0.5% poison is enough to break a reward model (L)! Again, fear not: somehow, it takes a (high) 5% poisoning before it transfers to the RLHF'd model (R). 4/n

So far, it seems like the system is shockingly robust, right? Unfortunately, this is an illusion. First row reconfirms [RT24]: with no SFT poison, it's very hard to poison the PPO'd model (needs >5% poison). But, with just a little SFT poison (0.5%), attack succeeds w/ 3%. 5/n

I poured my soul into building this course last fall: 📚 Graph Algorithms via Graph Decomposition 📚 Graph decomposition has been a powerful framework in graph algorithms for over 20 years, but the literature is scattered and technical. Thus, I tried to organize part of it into one coherent story.