//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfilePosts





🚨 New Pre-Print! 🚨 Reviewer 2 has once again asked for DL’19, what can you say in rebuttal?  To help, we have re-annotated DL’19. Work done with @maik_froebe.bsky.social, @hscells.bsky.social, @fschlatt1.bsky.social, Guglielmo Faggioli, Saber Zerhoudi, @macavaney.bsky.social, Eugene Yang 🧵
Mar 3, 2025
We consider that a human represents a bound on performance under a subjective task such as determining relevance, as only a single intent is defined in each topic. We find that systems are either indistinguishable from humans or exceed humans as oracle rankers.
We then look downstream, what effect does re-annotation have on modern systems? We find that modern system comparisons are increasingly unstable on DL’19, meaning that determining the pair-wise ordering of systems when measured nDCG values are far apart remains unstable.
We look into causes of disagreement, finding that subtle differences in query intent, even when relevance is well defined, can lead to greater disagreement in 4-grade relevance. However, we find that it is challenging to agree on what is relevant even under a fixed narrative.
This work was devised at the ECIR collab-a-thon last year, and we hope to continue discussions at this year's collab-a-thon in Lucca! Read more here: arxiv.org/abs/2502.20937 #ECIR2025 #SIGIR2025
Re-annotation is commonly performed to validate how variations in relevance judgements affect our ability to discriminate between retrieval systems. We validate hypotheses on stability, but in a modern setting, there are no narratives, 4-grade relevance, and a neural pool.
Mar 3, 2025
Mar 3, 2025
Mar 3, 2025
Mar 3, 2025
Mar 3, 2025
The fundamental property of Cranfield-style evaluations, that system rankings are stable even when assessors disagree on individual relevance decisions, was validated on traditional test collections. ...
arxiv.org
Variations in Relevance Judgments and the Shelf Life of Test Collections