🚨 New Pre-Print! 🚨 Reviewer 2 has once again asked for DL’19, what can you say in rebuttal?  To help, we have re-annotated DL’19. Work done with @maik_froebe.bsky.social, @hscells.bsky.social, @fschlatt1.bsky.social, Guglielmo Faggioli, Saber Zerhoudi, @macavaney.bsky.social, Eugene Yang 🧵
We consider that a human represents a bound on performance under a subjective task such as determining relevance, as only a single intent is defined in each topic. We find that systems are either indistinguishable from humans or exceed humans as oracle rankers.
We then look downstream, what effect does re-annotation have on modern systems? We find that modern system comparisons are increasingly unstable on DL’19, meaning that determining the pair-wise ordering of systems when measured nDCG values are far apart remains unstable.
We look into causes of disagreement, finding that subtle differences in query intent, even when relevance is well defined, can lead to greater disagreement in 4-grade relevance. However, we find that it is challenging to agree on what is relevant even under a fixed narrative.
This work was devised at the ECIR collab-a-thon last year, and we hope to continue discussions at this year's collab-a-thon in Lucca! Read more here: arxiv.org/abs/2502.20937 #ECIR2025 #SIGIR2025
Re-annotation is commonly performed to validate how variations in relevance judgements affect our ability to discriminate between retrieval systems. We validate hypotheses on stability, but in a modern setting, there are no narratives, 4-grade relevance, and a neural pool.
The fundamental property of Cranfield-style evaluations, that system rankings are stable even when assessors disagree on individual relevance decisions, was validated on traditional test collections. ...