Inlay

ProfilePosts

Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵

There’s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases — which is where bias actually matters. IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it! New results 🧵

🚨 New paper alert: A new generation of LLMs can now process speech natively. This could expand access for millions excluded by text interfaces, but our research shows a cost: demographic cues in speaker voice can trigger stereotypical model responses. 🎙️⚖️ Paper: arxiv.org/abs/2603.22260

6mo

7mo

3mo

Demographic cues (eg, names, dialect) are widely used to study how LLM behavior may change depending on user demographics. Such cues are often assumed interchangeable. 🚨 We show they are not: different cues yield different model behavior for the same group and different conclusions on LLM bias. 🧵👇

Excited to see our #COLM2025 paper on fluid benchmarking highlighted by @eval-eval.bsky.social! They are worth a follow if you are into LLM eval research. 🔬

4mo

7mo

LM benchmark design requires 3 decisions, how to: 🐟 select test cases 🐠 score LM on each test 🦈 aggregate scores to estimate perf fluid benchmarking is simple: 🍣 find max informative test cases 🍥 estimate 'ability', not simple avg perf why care? turn ur grey noisy benchmarks to red ones!

9mo

Ai2

Paul Röttger

Carolin Holtermann

Are LLMs biased when they write about political issues? We just released IssueBench – the largest, most realistic benchmark of its kind – to answer this question more robustly than ever before. Long 🧵with spicy results 👇

Feb 13, 2025

Paul Röttger

Manuel Tonneau

Valentin Hofmann

Check out this #EMNLP2025 paper led by @minhducbui.bsky.social and @carolin-holtermann.bsky.social showing dialect prejudice remains a major issue in current LLMs. Example: GPT-5 associates German dialect speakers with being uneducated and steers them toward stereotyped jobs (e.g., farmworkers). 👇

Kyle Lo

✨ Weekly AI Evaluation Paper Spotlight ✨ 🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation? 🖇️ "Fluid Language Model Benchmarking" by @valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models

7mo

8mo

3mo

📢 Life update 📢 After a wonderful time at @ai2.bsky.social, I've joined @cislmu.bsky.social at @lmu.de as a tenure-track assistant professor in NLP. Thrilled to be back in Europe and to start a lab in Munich's flourishing AI ecosystem! 🎉

Valentin Hofmann

EvalEval Coalition

📢 New #COLM2025 paper 📢 Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴 Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost. 🧵

9mo

Valentin Hofmann

“You speak Bavarian? Then you must be uneducated + closed-minded.” 🤯 Not your opinion? Good. But it might be your LLM’s!! 🧵 Check out our #EMNLP2025 paper, where we uncover concerning dialect bias in recent LLMs - including GPT-5. #AI #Bias #Dialect #Fairness #LLM #NLProc #Safety

9mo

🚀 Introducing Fluid Benchmarking—an adaptive way to evaluate LLMs. Inspired by psychometrics, it tailors which questions to ask based on each model’s capability, making evals more efficient & reliable. 🧵

9mo

Ai2

Trustworthy AI Lab