🧠🤖 The 2026 New England Mechanistic Interpretability (NEMI) Workshop will be Aug. 14 at Boston University!
Help spread the word and join the New England mech interp community! Registration and submission info in thread:👇
Can steering remove LLM shortcuts without breaking legitimate LLM capabilities?
In our @eaclmeeting.bsky.social paper, we show that conceptual bias is separable from concept detection; this means inference-time debiasing is possible with minimal capability loss.
2/
We study implicit biases via a word association task: the model assigns demographic labels to names or professions (e.g., “engineer → ?”, “Jack → ?”).
Inspired by prior work on implicit associations in LLMs (e.g., Xuechunzi Bai et al., 2025).
5/
We find that race, gender, and education shortcuts rely on different internal mechanisms, so no single debiasing method works universally.
In other words, there is no one-size-fits-all debiasing method!
3/
We study the use of demographic information where this info is:
• causally relevant (name → demographic),
• irrelevant (profession → demographic), or
• partially relevant (profession → education).
This lets us separate legitimate recognition from stereotyping.