//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...







Loading...
Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵
6mo
Ai2
Excited to see our #COLM2025 paper on fluid benchmarking highlighted by @eval-eval.bsky.social! They are worth a follow if you are into LLM eval research. 🔬
📢 Life update 📢 After a wonderful time at @ai2.bsky.social, I've joined @cislmu.bsky.social at @lmu.de as a tenure-track assistant professor in NLP. Thrilled to be back in Europe and to start a lab in Munich's flourishing AI ecosystem! 🎉
Check out this #EMNLP2025 paper led by @minhducbui.bsky.social and @carolin-holtermann.bsky.social showing dialect prejudice remains a major issue in current LLMs. Example: GPT-5 associates German dialect speakers with being uneducated and steers them toward stereotyped jobs (e.g., farmworkers). 👇
LM benchmark design requires 3 decisions, how to: 🐟 select test cases 🐠 score LM on each test 🦈 aggregate scores to estimate perf fluid benchmarking is simple: 🍣 find max informative test cases 🍥 estimate 'ability', not simple avg perf why care? turn ur grey noisy benchmarks to red ones!
7mo
3mo
9mo
8mo
Valentin Hofmann
Demographic cues (eg, names, dialect) are widely used to study how LLM behavior may change depending on user demographics. Such cues are often assumed interchangeable. 🚨 We show they are not: different cues yield different model behavior for the same group and different conclusions on LLM bias. 🧵👇
Valentin Hofmann
🚨 New paper alert: A new generation of LLMs can now process speech natively. This could expand access for millions excluded by text interfaces, but our research shows a cost: demographic cues in speaker voice can trigger stereotypical model responses. 🎙️⚖️ Paper: arxiv.org/abs/2603.22260
Valentin Hofmann
4mo
Kyle Lo
3mo
There’s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases — which is where bias actually matters. IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it! New results 🧵
7mo
Carolin Holtermann
✨ Weekly AI Evaluation Paper Spotlight ✨ 🤔Is it time to move beyond static tests and toward more dynamic, adaptive, and model-aware evaluation? 🖇️ "Fluid Language Model Benchmarking" by @valentinhofmann.bsky.social et. al introduces a dynamic benchmarking method for evaluating language models
Manuel Tonneau
📢 New #COLM2025 paper 📢 Standard benchmarks give every LLM the same questions. This is like testing 5th graders and college seniors with *one* exam! 🥴 Meet Fluid Benchmarking, a capability-adaptive eval method delivering lower variance, higher validity, and reduced cost. 🧵
“You speak Bavarian? Then you must be uneducated + closed-minded.” 🤯 Not your opinion? Good. But it might be your LLM’s!! 🧵 Check out our #EMNLP2025 paper, where we uncover concerning dialect bias in recent LLMs - including GPT-5. #AI #Bias #Dialect #Fairness #LLM #NLProc #Safety
7mo
9mo
9mo
Paul Röttger
Are LLMs biased when they write about political issues? We just released IssueBench – the largest, most realistic benchmark of its kind – to answer this question more robustly than ever before. Long 🧵with spicy results 👇
EvalEval Coalition
Valentin Hofmann
Trustworthy AI Lab
Feb 13, 2025
🚀 Introducing Fluid Benchmarking—an adaptive way to evaluate LLMs. Inspired by psychometrics, it tailors which questions to ask based on each model’s capability, making evals more efficient & reliable. 🧵
Paul Röttger
9mo
Ai2