Survey-style tests developed for humans may not predict how LLMs actually behave.
Our #EACL2026 paper shows they can even be misleading when measuring racism and sexism!
Check out the paper ๐๐ผ
Marlene Lutz
Are you using survey-style questionnaires designed for humans to measure characteristics of LLMs?
In our #EACL2026 paper, we evaluate both the reliability and validity of such tests and found that their scores do not reflect real-world model behavior. In fact, they can be deceptive!
๐งต1/3