More on this production-centered mechanism across models and + implications for evaluation, interpretation, and pre-training:
๐ instruction-probing.github.io
๐ arxiv.org/abs/2605.11206
Team effort with @lchoshen.bsky.social @yufanghou.bsky.social @yperlitz.bsky.social๐
Questions? Reach out! (3/3)
Excited to present this work together with @dippedrusk.com at #EACL. Join us in the poster session 1 (11:30-13:00) ๐ฅ
Andreas Waldis
Andreas Waldis
LMs that "know more" about toxicity are less toxic!
Our #TACL ๐ connects behavior and internals:
๐ LMs amplify toxicity beyond humans
๐ Information about toxicity peaks in lower layers
๐ Bypassing these layers increases toxicity
More details๐ #NLProc #interpretability (1/๐งต)
Andreas Waldis
Thanks a lot to everyone for the support, guidance, mentoring, collaboration, and great moments over the past years! ๐ Without you, this journey wouldn't have been such a pleasure โ and now excited to see what the future brings! ๐
Do instructions affect how LMs process and produce language?
โ๏ธNot the way you think!
๐ฒLMs barely change task information when processing a task sample. Instead, instructions shape how this information is accessed and expressed when producing output tokens.
#interpretability #nlproc (1/๐งต)
Video
Andreas Waldis
In short, instructions act less on what models process, and more on what they emit.
Behavior changes from prompting, including prompt instability and in-context learning, therefore seem to arise mainly at the production stage, with little adaptation during task-sample processing. (2/๐งต)