Inlay

Inspired by this research, I unleashed OpenAI’s agentic mode upon classic Zork. On the one hand, it’s impressive that it could play this sim at all! On the other, after figuring out basic gameplay, it got stuck looping in circles and randomly dropping stuff, like a clumsy lobotomized hamster.

LLMs suck at Zork How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork? "all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points"

In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure g...