Introducing Global PIQA, a new multilingual benchmark for 100+ languages. This benchmark is the outcome of this year’s MRL shared task, in collaboration with 300+ researchers from 65 countries. This dataset evaluates physical commonsense reasoning in culturally relevant contexts.
Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers 🎉
This week I’ll be in New York giving talks at NYU, Yale, and Cornell Tech.
If you’re around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!
Multilingual Representation Workshop @ EMNLP 2026
Itay Itzhak @ COLM 🍁
Ever used a top-ranked LLM that just... felt wrong for you?
You’re not alone. Instead of leaderboards, many of us turn to "vibe-testing" - manually comparing models to our own needs. But can we turn these feelings into a structured evaluation?
New paper: "From Feelings to Metrics" 🧵
The anatomy of a vibe-test:
👉 Input: Users adapt task complexity and context to mimic their daily life.
👉 Output: The judge is the user. What counts as "clear" or "good tone" is defined by the individual’s perspective, not a static definition.
Results: standard evaluation mask model quality. 🎭
Top models like GPT-5.1 were initially less preferred by beginners, but their win rate skyrocketed on personalized tasks (e.g., 9% to 94%). This shows how superior models are better for all users when evaluation is *user-aware*. 📈
Can we scale it? We built a pipeline mirroring vibe-testing structure:
👤 Profile: Turns user descriptions into structured profiles.
✍️ Rewrite: Personalizes prompts for specific contexts.
⚖️ Judge: Compares models using user-defined criteria.
Automation meets personalization. 🤖✨
Itay Itzhak @ COLM 🍁
Itay Itzhak @ COLM 🍁
What does "vibe-testing" actually look like? 🕵️♂️
We analyzed public reports of vibe-tests—from YouTube to Reddit—to see how users evaluate models. Our analysis reveals vibe-testing recurring patterns, allowing us to formalize "vibe-testing" as a structured evaluation practice.