Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵
There’s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases — which is where bias actually matters.
IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it!
New results 🧵
🚨 New paper alert: A new generation of LLMs can now process speech natively. This could expand access for millions excluded by text interfaces, but our research shows a cost: demographic cues in speaker voice can trigger stereotypical model responses. 🎙️⚖️
Paper: arxiv.org/abs/2603.22260
Demographic cues (eg, names, dialect) are widely used to study how LLM behavior may change depending on user demographics. Such cues are often assumed interchangeable.
🚨 We show they are not: different cues yield different model behavior for the same group and different conclusions on LLM bias. 🧵👇
Excited to see our #COLM2025 paper on fluid benchmarking highlighted by @eval-eval.bsky.social! They are worth a follow if you are into LLM eval research. 🔬
LM benchmark design requires 3 decisions, how to:
🐟 select test cases
🐠 score LM on each test
🦈 aggregate scores to estimate perf
fluid benchmarking is simple:
🍣 find max informative test cases
🍥 estimate 'ability', not simple avg perf
why care? turn ur grey noisy benchmarks to red ones!
Ai2
Paul Röttger
Carolin Holtermann
Are LLMs biased when they write about political issues?
We just released IssueBench – the largest, most realistic benchmark of its kind – to answer this question more robustly than ever before.
Long 🧵with spicy results 👇
📢 Life update 📢
After a wonderful time at @ai2.bsky.social, I've joined @cislmu.bsky.social at @lmu.de as a tenure-track assistant professor in NLP. Thrilled to be back in Europe and to start a lab in Munich's flourishing AI ecosystem! 🎉
Valentin Hofmann
Valentin Hofmann
EvalEval Coalition
📢 New #COLM2025 paper 📢
Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴
Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.
🧵
Valentin Hofmann
“You speak Bavarian? Then you must be uneducated + closed-minded.”
🤯 Not your opinion? Good. But it might be your LLM’s!!
🧵 Check out our #EMNLP2025 paper, where we uncover concerning dialect bias in recent LLMs - including GPT-5.
#AI #Bias #Dialect #Fairness #LLM #NLProc #Safety
🚀 Introducing Fluid Benchmarking—an adaptive way to evaluate LLMs. Inspired by psychometrics, it tailors which questions to ask based on each model’s capability, making evals more efficient & reliable. 🧵