//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...








Loading...
LLMs are trained on lots of data, often from untrusted sources. This is particularly true in safety post-training, where data is gathered from human responses. Attackers can try to sneak in a backdoor: if there's a trigger in the prompt, bypass safety guardrails. 2/n
🧵Feeling safe against data poisoning in post-training? Think again! Individual components of LLM post-training pipelines are surprisingly robust to data poisoning attacks. In work led by Jack Sanderson (co-advised w Yiwei Lu), we show they crumble when attacked together. 1/n
There's multiple post-training phases attackers can infiltrate: SFT, DPO, PPO. Let's start with SFT. With just 2% SFT poisoning, 90% attack success (L)! But not to worry, RLHF works as we hope (?): it wipes away the poison. An RM scores outputs just like a clean model (R). 3/n
Lots more in the paper: how does DPO fit into the picture? What if attackers have different goals? etc. Paper: arxiv.org/abs/2606.04929 Code: github.com/jcksanderson... Led by Jack Sanderson (jcksanderson.com), w/ Yihan Wang, Xiaoqian Lu, co-supervised w/ Yiwei Lu
What about poisoning PPO? A remarkable paper of @javirandor.com and @floriantramer.bsky.social (arxiv.org/abs/2311.14455) shows that just 0.5% poison is enough to break a reward model (L)! Again, fear not: somehow, it takes a (high) 5% poisoning before it transfers to the RLHF'd model (R). 4/n
So far, it seems like the system is shockingly robust, right? Unfortunately, this is an illusion. First row reconfirms [RT24]: with no SFT poison, it's very hard to poison the PPO'd model (needs >5% poison). But, with just a little SFT poison (0.5%), attack succeeds w/ 3%. 5/n
I poured my soul into building this course last fall: 📚 Graph Algorithms via Graph Decomposition 📚 Graph decomposition has been a powerful framework in graph algorithms for over 20 years, but the literature is scattered and technical. Thus, I tried to organize part of it into one coherent story.