//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfileReplies






We further explore novel methods of comparing score distributions. 1. The mode loses even to other discrete methods such as the median or first percentile (and the gap grows when we use finer judgment granularity). 2. Incorporating risk aversion often improves performance.
For listwise ranking, although the mean does not improve accuracy, it drastically improves calibration. We also find that accuracy is maximized by directly predicting the list without an intermediate pairwise step, further underscoring the limitations of CoT for judgment.
Our findings stress the importance of leveraging the distributional output of LLM-as-a-judge, as opposed to using the text interface alone. Check out the full paper at arxiv.org/pdf/2503.03064!
For pairwise ranking, we compare pre- vs. post-aggregation (bottom vs. top in figure) of the two presentation orders' judgments. Small judges suffer from severe position bias, but pre-aggregation leverages the relative magnitudes of preference, boosting accuracy by 56.7→73.1.
1. Taking the mean of the judgment distribution consistently outperforms taking the mode, across the pointwise, pairwise, and listwise settings.
LLM judges have become ubiquitous, but valuable signal is often ignored at inference. We analyze design decisions for leveraging judgment distributions from LLM-as-a-judge: 🧵 (w/ Michael J.Q. Zhang, @eunsol.bsky.social)
2. Chain-of-thought prompting leads to sharp judgment distributions. Removing it increases the spread of the distribution, often improving performance, and more so for the mean than for the mode, revealing the synergy between eliciting and using distributional judgment.