Are you using survey-style questionnaires designed for humans to measure characteristics of LLMs?
In our #EACL2026 paper, we evaluate both the reliability and validity of such tests and found that their scores do not reflect real-world model behavior. In fact, they can be deceptive!
🧵1/3