Noce paper on how well benchmarks cover what people do at work
arxiv.org/abs/2603.01203
Quote
"these observations suggest that agent benchmarking effort is driven less by alignment with real-world employment structure or economic value, and more by methodological convenience."
AI agents are increasingly developed and evaluated on benchmarks relevant to human work, yet it remains unclear how representative these benchmarking efforts are of the labor market as a whole. In thi...