//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...

Loading...
2/ Model A may beat model B on average, but model A can still lose to model B if judged by the min. over several tasks. I wrote a brief blog post on this (good time to announce I started a substack!). shuvom.substack.com/p/revenge-of...
Or maybe, revenge of the 1st quantile. What common AI benchmarking discourse misses.shuvom.substack.com
2mo
Revenge of the Worst Case
Shuvom Sadhuka
1/ CS majors are drilled to think about "worst-case" performance of algorithms. By contrast, much of the discourse on AI evals focuses on average-case or best-case (e.g. LLM X can solve IMO problems). Maybe one key to "reliability" is certifying the 1st quantile of outputs too, not just the mean.
2mo
Shuvom Sadhuka