The anatomy of a vibe-test:
π Input: Users adapt task complexity and context to mimic their daily life.
π Output: The judge is the user. What counts as "clear" or "good tone" is defined by the individualβs perspective, not a static definition.
Introducing Global PIQA, a new multilingual benchmark for 100+ languages. This benchmark is the outcome of this yearβs MRL shared task, in collaboration with 300+ researchers from 65 countries. This dataset evaluates physical commonsense reasoning in culturally relevant contexts.
Itay Itzhak @ COLM π
Ever used a top-ranked LLM that just... felt wrong for you?
Youβre not alone. Instead of leaderboards, many of us turn to "vibe-testing" - manually comparing models to our own needs. But can we turn these feelings into a structured evaluation?
New paper: "From Feelings to Metrics" π§΅
Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers π
This week Iβll be in New York giving talks at NYU, Yale, and Cornell Tech.
If youβre around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!
What does "vibe-testing" actually look like? π΅οΈββοΈ
We analyzed public reports of vibe-testsβfrom YouTube to Redditβto see how users evaluate models. Our analysis reveals vibe-testing recurring patterns, allowing us to formalize "vibe-testing" as a structured evaluation practice.
In Rio for #ICLR2026 π§π· and already had my first aΓ§aΓ! π§
Come chat LLM safety and evaluation, and stop by our ManagerBench poster (w/ @adisimhi)!
- Tomorrow (Friday) @ 10:30
- Poster Session 3, Pavilion 4
Model preferences are a two-way street between the modelβs capability and the userβs perspective.
By bridging the gap between benchmarks and real-world vibe-testing, we can evaluate AI the way humans actually use it.
arxiv.org/abs/2604.14137
technion-cs-nlp.github.io/vibe-testin...
Results: standard evaluation mask model quality. π
Top models like GPT-5.1 were initially less preferred by beginners, but their win rate skyrocketed on personalized tasks (e.g., 9% to 94%). This shows how superior models are better for all users when evaluation is *user-aware*. π
Why do we "vibe-test" and ignore leaderboards? We ran a survey to find out.
Our findings:
β 86% said theyβve used a model that "felt" significantly better (or worse) than its reported scores.
β 82% of you are "vibe-testing" models through direct interaction.
Can we scale it? We built a pipeline mirroring vibe-testing structure:
π€ Profile: Turns user descriptions into structured profiles.
βοΈ Rewrite: Personalizes prompts for specific contexts.
βοΈ Judge: Compares models using user-defined criteria.
Automation meets personalization. π€β¨