Inlay

ProfilePosts

RL boosts LLM reasoning—but why stop at math & code? 🤔 Meet Nemotron-CrossThink—a method to scale RL-based self-learning across law, physics, social science & more. 🔥Resulting in a model that reasons broadly, adapts dynamically, & uses 28% fewer tokens for correct answers! 🧵↓

Can self-supervised models 🤖 understand allophony 🗣? Excited to share my new #NAACL2025 paper: Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment arxiv.org/abs/2502.07029 (1/n)

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs: 🧵1/9

Can LLMs accurately aggregate information over long, information-dense texts? Not yet… We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!

🔈When LLMs solve tasks with a mid-to-low resource input or target language, their output quality is poor. We know that. But can we put our finger on what breaks inside the LLM? We introduce the 💥 translation barrier hypothesis 💥 for failed multilingual generation with LLMs. arxiv.org/abs/2506.22724

On my way to #NAACL2025 where I'll give a keynote at the noisy text workshop (WNUT), presenting some of the challenges & methods for dialect NLP + also discussing dialect speakers' perspectives! 🗨️ Beyond “noisy” text: How (and why) to process dialect data 🗓️ Saturday, May 3, 9:30–10:30

🚨New Paper: LLM developers aim to align models with values like helpfulness or harmlessness. But when these conflict, which values do models choose to support? We introduce ConflictScope, a fully-automated evaluation pipeline that reveals how models rank values under conflict. (📷 xkcd)

May 1, 2025

Apr 29, 2025

Jun 9, 2025

7mo

🚨New paper: Reward Models (RMs) are used to align LLMs, but can they be steered toward user-specific value/style preferences? With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵

11mo

Excited to announce our #NAACL2025 Oral paper! 🎉✨ We carried out the largest systematic study so far to map the links between upstream choices, intrinsic bias, and downstream zero-shot performance across 131 CLIP Vision-language encoders, 26 datasets, and 55 architectures!

Apr 29, 2025

Thrilled to share that this is out in @pnas.org today! 🎉 We show that linguistic generalization in language models can be due to underlying analogical mechanisms. Shoutout to my amazing co-authors @weissweiler.bsky.social, @davidrmortensen.bsky.social, Hinrich Schütze, and Janet Pierrehumbert!

8mo

Apr 29, 2025

May 9, 2025

Syeda Nahida Akter

Kwanghee Choi

Niyati Bafna

Lindia Tjuatja

Amanda Bertsch

Verena Blaschke