Inlay

Survey-style tests developed for humans may not predict how LLMs actually behave. Our #EACL2026 paper shows they can even be misleading when measuring racism and sexism! Check out the paper 👇🏼

Are you using survey-style questionnaires designed for humans to measure characteristics of LLMs? In our #EACL2026 paper, we evaluate both the reliability and validity of such tests and found that their scores do not reflect real-world model behavior. In fact, they can be deceptive! 🧵1/3