Inlay

Profile

The anatomy of a vibe-test: 👉 Input: Users adapt task complexity and context to mimic their daily life. 👉 Output: The judge is the user. What counts as "clear" or "good tone" is defined by the individual’s perspective, not a static definition.

1mo

Introducing Global PIQA, a new multilingual benchmark for 100+ languages. This benchmark is the outcome of this year’s MRL shared task, in collaboration with 300+ researchers from 65 countries. This dataset evaluates physical commonsense reasoning in culturally relevant contexts.

7mo

Itay Itzhak @ COLM 🍁

Ever used a top-ranked LLM that just... felt wrong for you? You’re not alone. Instead of leaderboards, many of us turn to "vibe-testing" - manually comparing models to our own needs. But can we turn these feelings into a structured evaluation? New paper: "From Feelings to Metrics" 🧵

Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers 🎉 This week I’ll be in New York giving talks at NYU, Yale, and Cornell Tech. If you’re around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!

What does "vibe-testing" actually look like? 🕵️‍♂️ We analyzed public reports of vibe-tests—from YouTube to Reddit—to see how users evaluate models. Our analysis reveals vibe-testing recurring patterns, allowing us to formalize "vibe-testing" as a structured evaluation practice.

In Rio for #ICLR2026 🇧🇷 and already had my first açaí! 🍧 Come chat LLM safety and evaluation, and stop by our ManagerBench poster (w/ @adisimhi)! - Tomorrow (Friday) @ 10:30 - Poster Session 3, Pavilion 4

Model preferences are a two-way street between the model’s capability and the user’s perspective. By bridging the gap between benchmarks and real-world vibe-testing, we can evaluate AI the way humans actually use it. arxiv.org/abs/2604.14137 technion-cs-nlp.github.io/vibe-testin...

Results: standard evaluation mask model quality. 🎭 Top models like GPT-5.1 were initially less preferred by beginners, but their win rate skyrocketed on personalized tasks (e.g., 9% to 94%). This shows how superior models are better for all users when evaluation is *user-aware*. 📈

1mo

8mo

1mo

Why do we "vibe-test" and ignore leaderboards? We ran a survey to find out. Our findings: ❌ 86% said they’ve used a model that "felt" significantly better (or worse) than its reported scores. ✅ 82% of you are "vibe-testing" models through direct interaction.

Can we scale it? We built a pipeline mirroring vibe-testing structure: 👤 Profile: Turns user descriptions into structured profiles. ✍️ Rewrite: Personalizes prompts for specific contexts. ⚖️ Judge: Compares models using user-defined criteria. Automation meets personalization. 🤖✨

1mo

Multilingual Representation Workshop @ EMNLP 2026

1mo