//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfilePosts







Loading...
Check out this #EMNLP2025 paper led by @minhducbui.bsky.social and @carolin-holtermann.bsky.social showing dialect prejudice remains a major issue in current LLMs. Example: GPT-5 associates German dialect speakers with being uneducated and steers them toward stereotyped jobs (e.g., farmworkers). 👇
8mo
There’s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases — which is where bias actually matters. IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it! New results 🧵
7mo
Valentin Hofmann
“You speak Bavarian? Then you must be uneducated + closed-minded.” 🤯 Not your opinion? Good. But it might be your LLM’s!! 🧵 Check out our #EMNLP2025 paper, where we uncover concerning dialect bias in recent LLMs - including GPT-5. #AI #Bias #Dialect #Fairness #LLM #NLProc #Safety
9mo
Paul Röttger
Demographic cues (eg, names, dialect) are widely used to study how LLM behavior may change depending on user demographics. Such cues are often assumed interchangeable. 🚨 We show they are not: different cues yield different model behavior for the same group and different conclusions on LLM bias. 🧵👇
4mo
Trustworthy AI Lab
🚨 New paper alert: A new generation of LLMs can now process speech natively. This could expand access for millions excluded by text interfaces, but our research shows a cost: demographic cues in speaker voice can trigger stereotypical model responses. 🎙️⚖️ Paper: arxiv.org/abs/2603.22260
3mo
Are LLMs biased when they write about political issues? We just released IssueBench – the largest, most realistic benchmark of its kind – to answer this question more robustly than ever before. Long 🧵with spicy results 👇
Excited to see our #COLM2025 paper on fluid benchmarking highlighted by @eval-eval.bsky.social! They are worth a follow if you are into LLM eval research. 🔬
Manuel Tonneau
Feb 13, 2025
7mo
Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵
6mo
9mo
LM benchmark design requires 3 decisions, how to: 🐟 select test cases 🐠 score LM on each test 🦈 aggregate scores to estimate perf fluid benchmarking is simple: 🍣 find max informative test cases 🍥 estimate 'ability', not simple avg perf why care? turn ur grey noisy benchmarks to red ones!
Valentin Hofmann
Paul Röttger
Carolin Holtermann
📢 Life update 📢 After a wonderful time at @ai2.bsky.social, I've joined @cislmu.bsky.social at @lmu.de as a tenure-track assistant professor in NLP. Thrilled to be back in Europe and to start a lab in Munich's flourishing AI ecosystem! 🎉
3mo
Ai2
Kyle Lo
Valentin Hofmann
✨ Weekly AI Evaluation Paper Spotlight ✨ 🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation? 🖇️ "Fluid Language Model Benchmarking" by @valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models
7mo
📢 New #COLM2025 paper 📢 Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴 Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost. 🧵
9mo
Valentin Hofmann
EvalEval Coalition
🚀 Introducing Fluid Benchmarking—an adaptive way to evaluate LLMs. Inspired by psychometrics, it tailors which questions to ask based on each model’s capability, making evals more efficient & reliable. 🧵
9mo
Ai2