//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
Results: standard evaluation mask model quality. 🎭 Top models like GPT-5.1 were initially less preferred by beginners, but their win rate skyrocketed on personalized tasks (e.g., 9% to 94%). This shows how superior models are better for all users when evaluation is *user-aware*. 📈
1mo
Itay Itzhak @ COLM 🍁