Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵
Ai2
Excited to see our #COLM2025 paper on fluid benchmarking highlighted by @eval-eval.bsky.social! They are worth a follow if you are into LLM eval research. 🔬
📢 Life update 📢
After a wonderful time at @ai2.bsky.social, I've joined @cislmu.bsky.social at @lmu.de as a tenure-track assistant professor in NLP. Thrilled to be back in Europe and to start a lab in Munich's flourishing AI ecosystem! 🎉
Check out this #EMNLP2025 paper led by @minhducbui.bsky.social and @carolin-holtermann.bsky.social showing dialect prejudice remains a major issue in current LLMs.
Example: GPT-5 associates German dialect speakers with being uneducated and steers them toward stereotyped jobs (e.g., farmworkers).
👇
LM benchmark design requires 3 decisions, how to:
🐟 select test cases
🐠 score LM on each test
🦈 aggregate scores to estimate perf
fluid benchmarking is simple:
🍣 find max informative test cases
🍥 estimate 'ability', not simple avg perf
why care? turn ur grey noisy benchmarks to red ones!
Valentin Hofmann
Demographic cues (eg, names, dialect) are widely used to study how LLM behavior may change depending on user demographics. Such cues are often assumed interchangeable.
🚨 We show they are not: different cues yield different model behavior for the same group and different conclusions on LLM bias. 🧵👇
Valentin Hofmann
🚨 New paper alert: A new generation of LLMs can now process speech natively. This could expand access for millions excluded by text interfaces, but our research shows a cost: demographic cues in speaker voice can trigger stereotypical model responses. 🎙️⚖️
Paper: arxiv.org/abs/2603.22260
Valentin Hofmann
Kyle Lo
There’s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases — which is where bias actually matters.
IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it!
New results 🧵
Carolin Holtermann
✨ Weekly AI Evaluation Paper Spotlight ✨
🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation?
🖇️ "Fluid Language Model Benchmarking" by
@valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models
Manuel Tonneau
📢 New #COLM2025 paper 📢
Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴
Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost.
🧵
“You speak Bavarian? Then you must be uneducated + closed-minded.”
🤯 Not your opinion? Good. But it might be your LLM’s!!
🧵 Check out our #EMNLP2025 paper, where we uncover concerning dialect bias in recent LLMs - including GPT-5.
#AI #Bias #Dialect #Fairness #LLM #NLProc #Safety
Paul Röttger
Are LLMs biased when they write about political issues?
We just released IssueBench – the largest, most realistic benchmark of its kind – to answer this question more robustly than ever before.
Long 🧵with spicy results 👇
EvalEval Coalition
Valentin Hofmann
Trustworthy AI Lab
🚀 Introducing Fluid Benchmarking—an adaptive way to evaluate LLMs. Inspired by psychometrics, it tailors which questions to ask based on each model’s capability, making evals more efficient & reliable. 🧵