Benchmarks can be superficial, but model explanations and evaluations are fundamentally intertwined. What if we used interpretability as principled, scientific evaluation? If it met scientific standards?
arxiv.org/abs/2605.05508
coming to EvalEval at ACL as oral 🧵
1/6
We don’t always know what problems are hard for LLMs. So devs evaluate on tasks HUMANS find hard or on broad benchmarks. What if we could instead anticipate which scenarios a model will fail on—all without evaluating specific input examples?
🧵NEW PAPER by @jenniferlumeng.bsky.social
Isabelle Lee @ ICML
Naomi Saphra
Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How?
We're ready to answer.
🧵
1. Falsifiability enables debugging. First argued by Leavitt & Morcos, it has to produce hypotheses that can be proven wrong. And if it can, we can then act on it, to trace back to the source of error and attempt a fix.
3/6
Hadas Orgad
really excited to head home for icml:) and attending the co-located FAR.ai alignment workshop (for the first time)! would love to meet others interested in training & interpretability
What does it mean for interp to meet scientific standards? We argue that it has to meet 3 criteria: falsifiability, reproducibility and predictability.
2/6
2. Reproducibility detects faulty mechanisms. If we were to actually act on it, we want our claim to identify mechanisms robustly against variations wrt input, method, etc. Our claim needs to be reproducible under specified conditions.
4/6
also, blog: iglee.me/papers/inte...
7/6
3. Predicting failures. A distinction: scientific prediction (not the ML kind) is how scientists validate our understanding. A hypothesis proves its strength w/ predictive power. Used as eval, interp can predict failures from internals. Meaning, we generate eval from interp.
5/6
Isabelle Lee @ ICML
work w/ Emmy Liu, Cathy Jiao @brihi.bsky.social, Dani Yogatama, Fazl Barez, @saxon.me
since i'm headed home for icml, presented by amazing @brihi.bsky.social!
this was my first time writing a position paper, which turned into a grant, which i'm turning into multiple projects 🙂 stay tuned
6/6