Do instructions affect how LMs process and produce language?
โ๏ธNot the way you think!
๐ฒLMs barely change task information when processing a task sample. Instead, instructions shape how this information is accessed and expressed when producing output tokens.
#interpretability #nlproc (1/๐งต)
I already presented some work on reference (names, pronouns, coreference resolution, pronoun fidelity, etc.) asย a rich site to evaluate biases and commonsense reasoning, and our work on disentangling model behaviour and internals through aligned probing (led by @tresiwald.bsky.social).
In short, instructions act less on what models process, and more on what they emit.
Behavior changes from prompting, including prompt instability and in-context learning, therefore seem to arise mainly at the production stage, with little adaptation during task-sample processing. (2/๐งต)
Thanks a lot to everyone for the support, guidance, mentoring, collaboration, and great moments over the past years! ๐ Without you, this journey wouldn't have been such a pleasure โ and now excited to see what the future brings! ๐
It was a pleasure to ๐ธ
Excited to present this work together with @dippedrusk.com at #EACL. Join us in the poster session 1 (11:30-13:00) ๐ฅ
More on this production-centered mechanism across models and + implications for evaluation, interpretation, and pre-training:
๐ instruction-probing.github.io
๐ arxiv.org/abs/2605.11206
Team effort with @lchoshen.bsky.social @yufanghou.bsky.social @yperlitz.bsky.social๐
Questions? Reach out! (3/3)