Inlay

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus Introduces a benchmark that varies the language of supporting evidence while keeping English questions and answers. 📝 arxiv.org/abs/2606.15345

Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely ass...