Inlay

Profile

Huge thanks to all my amazing collaborators: Renqing Cuomao, Daniil Yurshevich, Anna Sotnikova, Lonneke van der Plas, @abosselut.bsky.social 📄 Paper: arxiv.org/abs/2604.03374 🤗 Benchmark: huggingface.co/datasets/mis... 🌐 Project: mete.is/cresowlve #NLP #LLM #AIResearch #Benchmark #Creativity

2mo

Mete

2mo

And it's not just *what* you know — it's *how* you think. 72% of puzzles require lateral thinking. Many involve analogy-making, abstraction, metaphors, jokes, and puns. Most questions combine 2+ creative reasoning strategies.

Mete

These aren't your typical trivia questions. CresOWLve spans 34 knowledge domains — from Literature to Astronomy to Art — covering 2,061 carefully curated puzzles across more than 26 cultures. Solving them demands *connecting facts across domains in non-obvious ways* 🌍📚

LLMs can retrieve knowledge — but can they connect it in *creative* ways to solve problems? Introducing CresOWLve 🦉, a new benchmark that evaluates creative problem-solving over real-world knowledge, using puzzles that require multiple creative thinking strategies.👇

2mo

LLM performance? 📉 Non-thinking models under 30% (with CoT), most thinking models under 60%. 📉 Models perform up to 17% worse on creative vs. factual questions. Crucially, models *can* retrieve the relevant facts — they just fail to form the creative connection between them.

Mete

2mo

Mete

Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect se...

arxiv.org

CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge