//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...









Loading...
The anatomy of a vibe-test: πŸ‘‰ Input: Users adapt task complexity and context to mimic their daily life. πŸ‘‰ Output: The judge is the user. What counts as "clear" or "good tone" is defined by the individual’s perspective, not a static definition.
1mo
Introducing Global PIQA, a new multilingual benchmark for 100+ languages. This benchmark is the outcome of this year’s MRL shared task, in collaboration with 300+ researchers from 65 countries. This dataset evaluates physical commonsense reasoning in culturally relevant contexts.
7mo
Itay Itzhak @ COLM 🍁
Ever used a top-ranked LLM that just... felt wrong for you? You’re not alone. Instead of leaderboards, many of us turn to "vibe-testing" - manually comparing models to our own needs. But can we turn these feelings into a structured evaluation? New paper: "From Feelings to Metrics" 🧡
Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers πŸŽ‰ This week I’ll be in New York giving talks at NYU, Yale, and Cornell Tech. If you’re around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!
What does "vibe-testing" actually look like? πŸ•΅οΈβ€β™‚οΈ We analyzed public reports of vibe-testsβ€”from YouTube to Redditβ€”to see how users evaluate models. Our analysis reveals vibe-testing recurring patterns, allowing us to formalize "vibe-testing" as a structured evaluation practice.
In Rio for #ICLR2026 πŸ‡§πŸ‡· and already had my first aΓ§aΓ­! 🍧 Come chat LLM safety and evaluation, and stop by our ManagerBench poster (w/ @adisimhi)! - Tomorrow (Friday) @ 10:30 - Poster Session 3, Pavilion 4
Model preferences are a two-way street between the model’s capability and the user’s perspective. By bridging the gap between benchmarks and real-world vibe-testing, we can evaluate AI the way humans actually use it. arxiv.org/abs/2604.14137 technion-cs-nlp.github.io/vibe-testin...
Results: standard evaluation mask model quality. 🎭 Top models like GPT-5.1 were initially less preferred by beginners, but their win rate skyrocketed on personalized tasks (e.g., 9% to 94%). This shows how superior models are better for all users when evaluation is *user-aware*. πŸ“ˆ
1mo
8mo
1mo
Why do we "vibe-test" and ignore leaderboards? We ran a survey to find out. Our findings: ❌ 86% said they’ve used a model that "felt" significantly better (or worse) than its reported scores. βœ… 82% of you are "vibe-testing" models through direct interaction.
Can we scale it? We built a pipeline mirroring vibe-testing structure: πŸ‘€ Profile: Turns user descriptions into structured profiles. ✍️ Rewrite: Personalizes prompts for specific contexts. βš–οΈ Judge: Compares models using user-defined criteria. Automation meets personalization. πŸ€–βœ¨
1mo
Multilingual Representation Workshop @ EMNLP 2026
1mo
1mo