Check out this #EMNLP2025 paper led by @minhducbui.bsky.social and @carolin-holtermann.bsky.social showing dialect prejudice remains a major issue in current LLMs.
Example: GPT-5 associates German dialect speakers with being uneducated and steers them toward stereotyped jobs (e.g., farmworkers).
👇
There’s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases — which is where bias actually matters.
IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it!
New results 🧵
Valentin Hofmann
“You speak Bavarian? Then you must be uneducated + closed-minded.”
🤯 Not your opinion? Good. But it might be your LLM’s!!
🧵 Check out our #EMNLP2025 paper, where we uncover concerning dialect bias in recent LLMs - including GPT-5.
#AI #Bias #Dialect #Fairness #LLM #NLProc #Safety
Paul Röttger
Demographic cues (eg, names, dialect) are widely used to study how LLM behavior may change depending on user demographics. Such cues are often assumed interchangeable.
🚨 We show they are not: different cues yield different model behavior for the same group and different conclusions on LLM bias. 🧵👇
Trustworthy AI Lab
🚨 New paper alert: A new generation of LLMs can now process speech natively. This could expand access for millions excluded by text interfaces, but our research shows a cost: demographic cues in speaker voice can trigger stereotypical model responses. 🎙️⚖️
Paper: arxiv.org/abs/2603.22260
Are LLMs biased when they write about political issues?
We just released IssueBench – the largest, most realistic benchmark of its kind – to answer this question more robustly than ever before.
Long 🧵with spicy results 👇
Excited to see our #COLM2025 paper on fluid benchmarking highlighted by @eval-eval.bsky.social! They are worth a follow if you are into LLM eval research. 🔬
Manuel Tonneau
Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵
LM benchmark design requires 3 decisions, how to:
🐟 select test cases
🐠 score LM on each test
🦈 aggregate scores to estimate perf
fluid benchmarking is simple:
🍣 find max informative test cases
🍥 estimate 'ability', not simple avg perf
why care? turn ur grey noisy benchmarks to red ones!
Valentin Hofmann
Paul Röttger
Carolin Holtermann
📢 Life update 📢
After a wonderful time at @ai2.bsky.social, I've joined @cislmu.bsky.social at @lmu.de as a tenure-track assistant professor in NLP. Thrilled to be back in Europe and to start a lab in Munich's flourishing AI ecosystem! 🎉
Ai2
Kyle Lo
Valentin Hofmann
✨ Weekly AI Evaluation Paper Spotlight ✨
🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation?
🖇️ "Fluid Language Model Benchmarking" by
@valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models
📢 New #COLM2025 paper 📢
Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴
Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.
🧵
Valentin Hofmann
EvalEval Coalition
🚀 Introducing Fluid Benchmarking—an adaptive way to evaluate LLMs. Inspired by psychometrics, it tailors which questions to ask based on each model’s capability, making evals more efficient & reliable. 🧵