Model preferences are a two-way street between the model’s capability and the user’s perspective.
By bridging the gap between benchmarks and real-world vibe-testing, we can evaluate AI the way humans actually use it.
arxiv.org/abs/2604.14137
technion-cs-nlp.github.io/vibe-testin...
A paper on vibe-testing and personalized LLM evaluation, showing that personalization can change which model users prefer.