Real user queries often look different from the clean, concise ones in academic benchmarks - ambiguity, full of typos, and much less readable.
We show that even strong RAG systems quickly break under these conditions.
Awesome project led by
@neelbhandari.bsky.social and @tianyucao.bsky.social!!
These days RAG systems have gotten popular for boosting LLMsโbut they're brittle๐. Minor shifts in phrasing (โ๏ธ style, politeness, typos) can wreck the pipeline. Even advanced components donโt fix the issue.
Check out this extensive eval by @neelbhandari.bsky.social and @tianyucao.bsky.social!
Akari Asai
1/๐จ ๐ก๐ฒ๐ ๐ฝ๐ฎ๐ฝ๐ฒ๐ฟ ๐ฎ๐น๐ฒ๐ฟ๐ ๐จ
RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style?
We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline ๐งต
Neel Bhandari
Akhila Yerukola
1/๐จ ๐ก๐ฒ๐ ๐ฝ๐ฎ๐ฝ๐ฒ๐ฟ ๐ฎ๐น๐ฒ๐ฟ๐ ๐จ
RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style?
We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline ๐งต