LMs that "know more" about toxicity are less toxic!
Our #TACL ๐ connects behavior and internals:
๐ LMs amplify toxicity beyond humans
๐ Information about toxicity peaks in lower layers
๐ Bypassing these layers increases toxicity
More details๐ #NLProc #interpretability (1/๐งต)