LMs that "know more" about toxicity are less toxic!
Our #TACL 📄 connects behavior and internals:
💠 LMs amplify toxicity beyond humans
💠 Information about toxicity peaks in lower layers
💠 Bypassing these layers increases toxicity
More details👇 #NLProc #interpretability (1/🧵)
Current LLMs are exactly 0 useful for tweaking style in PDFs generated with LaTeX from quarto documents. It's niche, admittedly, but a great area showing severe limits close to complete breakdown of capabilities. Not possible to entertain illusion of "understanding" here. Full hallucination mode.