Inlay

LMs that "know more" about toxicity are less toxic! Our #TACL 📄 connects behavior and internals: 💠 LMs amplify toxicity beyond humans 💠 Information about toxicity peaks in lower layers 💠 Bypassing these layers increases toxicity More details👇 #NLProc #interpretability (1/🧵)