Inlay

ProfilePosts

Check out this #EMNLP2025 paper led by @minhducbui.bsky.social and @carolin-holtermann.bsky.social showing dialect prejudice remains a major issue in current LLMs. Example: GPT-5 associates German dialect speakers with being uneducated and steers them toward stereotyped jobs (e.g., farmworkers). 👇

8mo

There’s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases — which is where bias actually matters. IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it! New results 🧵

7mo

Valentin Hofmann

“You speak Bavarian? Then you must be uneducated + closed-minded.” 🤯 Not your opinion? Good. But it might be your LLM’s!! 🧵 Check out our #EMNLP2025 paper, where we uncover concerning dialect bias in recent LLMs - including GPT-5. #AI #Bias #Dialect #Fairness #LLM #NLProc #Safety

9mo

Paul Röttger

Demographic cues (eg, names, dialect) are widely used to study how LLM behavior may change depending on user demographics. Such cues are often assumed interchangeable. 🚨 We show they are not: different cues yield different model behavior for the same group and different conclusions on LLM bias. 🧵👇

4mo

Trustworthy AI Lab

🚨 New paper alert: A new generation of LLMs can now process speech natively. This could expand access for millions excluded by text interfaces, but our research shows a cost: demographic cues in speaker voice can trigger stereotypical model responses. 🎙️⚖️ Paper: arxiv.org/abs/2603.22260

3mo

Are LLMs biased when they write about political issues? We just released IssueBench – the largest, most realistic benchmark of its kind – to answer this question more robustly than ever before. Long 🧵with spicy results 👇

Excited to see our #COLM2025 paper on fluid benchmarking highlighted by @eval-eval.bsky.social! They are worth a follow if you are into LLM eval research. 🔬

Manuel Tonneau

Feb 13, 2025

7mo

Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵

6mo

9mo

LM benchmark design requires 3 decisions, how to: 🐟 select test cases 🐠 score LM on each test 🦈 aggregate scores to estimate perf fluid benchmarking is simple: 🍣 find max informative test cases 🍥 estimate 'ability', not simple avg perf why care? turn ur grey noisy benchmarks to red ones!

Valentin Hofmann

Paul Röttger

Carolin Holtermann

📢 Life update 📢 After a wonderful time at @ai2.bsky.social, I've joined @cislmu.bsky.social at @lmu.de as a tenure-track assistant professor in NLP. Thrilled to be back in Europe and to start a lab in Munich's flourishing AI ecosystem! 🎉

3mo

Ai2

Kyle Lo

Valentin Hofmann

✨ Weekly AI Evaluation Paper Spotlight ✨ 🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation? 🖇️ "Fluid Language Model Benchmarking" by @valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models

7mo

📢 New #COLM2025 paper 📢 Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴 Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost. 🧵

9mo

Valentin Hofmann

EvalEval Coalition

🚀 Introducing Fluid Benchmarking—an adaptive way to evaluate LLMs. Inspired by psychometrics, it tailors which questions to ask based on each model’s capability, making evals more efficient & reliable. 🧵

9mo

Ai2