Inlay

ProfilePosts

More on this production-centered mechanism across models and + implications for evaluation, interpretation, and pre-training: 🔗 instruction-probing.github.io 📄 arxiv.org/abs/2605.11206 Team effort with @lchoshen.bsky.social @yufanghou.bsky.social @yperlitz.bsky.social🙌 Questions? Reach out! (3/3)

Excited to present this work together with @dippedrusk.com at #EACL. Join us in the poster session 1 (11:30-13:00) 🔥

27d

2mo

Andreas Waldis

LMs that "know more" about toxicity are less toxic! Our #TACL 📄 connects behavior and internals: 💠 LMs amplify toxicity beyond humans 💠 Information about toxicity peaks in lower layers 💠 Bypassing these layers increases toxicity More details👇 #NLProc #interpretability (1/🧵)

4mo

Andreas Waldis

Thanks a lot to everyone for the support, guidance, mentoring, collaboration, and great moments over the past years! 🙏 Without you, this journey wouldn't have been such a pleasure — and now excited to see what the future brings! 🚀

3mo

Do instructions affect how LMs process and produce language? ☝️Not the way you think! 😲LMs barely change task information when processing a task sample. Instead, instructions shape how this information is accessed and expressed when producing output tokens. #interpretability #nlproc (1/🧵)

27d

Video

Andreas Waldis

In short, instructions act less on what models process, and more on what they emit. Behavior changes from prompting, including prompt instability and in-context learning, therefore seem to arise mainly at the production stage, with little adaptation during task-sample processing. (2/🧵)

It was a pleasure to 🍸

Andreas Waldis

27d

12d