//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfilePosts









Loading...
Introducing Global PIQA, a new multilingual benchmark for 100+ languages. This benchmark is the outcome of this year’s MRL shared task, in collaboration with 300+ researchers from 65 countries. This dataset evaluates physical commonsense reasoning in culturally relevant contexts.
7mo
Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers 🎉 This week I’ll be in New York giving talks at NYU, Yale, and Cornell Tech. If you’re around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!
8mo
Multilingual Representation Workshop @ EMNLP 2026
Itay Itzhak @ COLM 🍁
Ever used a top-ranked LLM that just... felt wrong for you? You’re not alone. Instead of leaderboards, many of us turn to "vibe-testing" - manually comparing models to our own needs. But can we turn these feelings into a structured evaluation? New paper: "From Feelings to Metrics" 🧵
The anatomy of a vibe-test: 👉 Input: Users adapt task complexity and context to mimic their daily life. 👉 Output: The judge is the user. What counts as "clear" or "good tone" is defined by the individual’s perspective, not a static definition.
1mo
1mo
Results: standard evaluation mask model quality. 🎭 Top models like GPT-5.1 were initially less preferred by beginners, but their win rate skyrocketed on personalized tasks (e.g., 9% to 94%). This shows how superior models are better for all users when evaluation is *user-aware*. 📈
1mo
Can we scale it? We built a pipeline mirroring vibe-testing structure: 👤 Profile: Turns user descriptions into structured profiles. ✍️ Rewrite: Personalizes prompts for specific contexts. ⚖️ Judge: Compares models using user-defined criteria. Automation meets personalization. 🤖✨
1mo
Itay Itzhak @ COLM 🍁
Itay Itzhak @ COLM 🍁
What does "vibe-testing" actually look like? 🕵️‍♂️ We analyzed public reports of vibe-tests—from YouTube to Reddit—to see how users evaluate models. Our analysis reveals vibe-testing recurring patterns, allowing us to formalize "vibe-testing" as a structured evaluation practice.
Itay Itzhak @ COLM 🍁