Inlay

LLMs suck at Zork How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork? "all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points"

In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure g...