Excited to present this work together with @dippedrusk.com at #EACL. Join us in the poster session 1 (11:30-13:00) ๐ฅ
Andreas Waldis
LMs that "know more" about toxicity are less toxic!
Our #TACL ๐ connects behavior and internals:
๐ LMs amplify toxicity beyond humans
๐ Information about toxicity peaks in lower layers
๐ Bypassing these layers increases toxicity
More details๐ #NLProc #interpretability (1/๐งต)