Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus
Introduces a benchmark that varies the language of supporting evidence while keeping English questions and answers.
š arxiv.org/abs/2606.15345
Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely ass...