Real user queries often look different from the clean, concise ones in academic benchmarks - ambiguity, full of typos, and much less readable.
We show that even strong RAG systems quickly break under these conditions.
Awesome project led by
@neelbhandari.bsky.social and @tianyucao.bsky.social!!
Akari Asai
1/๐จ ๐ก๐ฒ๐ ๐ฝ๐ฎ๐ฝ๐ฒ๐ฟ ๐ฎ๐น๐ฒ๐ฟ๐ ๐จ
RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style?
We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline ๐งต