Inlay

So many people, CS researchers included, think that you can explore how an LLM works by simply asking it to tell you what it is doing or "thinking". Here @jennhu.bsky.social provides an excellent illustration of how that approach fails even at the most basic level.

To researchers doing LLM evaluation: prompting is *not a substitute* for direct probability measurements. Check out the camera-ready version of our work, to appear at EMNLP 2023! (w/ @rplevy.bsky.social) Paper: arxiv.org/abs/2305.13264 Original thread: twitter.com/_jennhu/stat...