Huge thanks to all my amazing collaborators: Renqing Cuomao, Daniil Yurshevich, Anna Sotnikova, Lonneke van der Plas, @abosselut.bsky.social
π Paper: arxiv.org/abs/2604.03374
π€ Benchmark: huggingface.co/datasets/mis...
π Project: mete.is/cresowlve
#NLP #LLM #AIResearch #Benchmark #Creativity
Mete
And it's not just *what* you know β it's *how* you think.
72% of puzzles require lateral thinking. Many involve analogy-making, abstraction, metaphors, jokes, and puns. Most questions combine 2+ creative reasoning strategies.
Mete
These aren't your typical trivia questions.
CresOWLve spans 34 knowledge domains β from Literature to Astronomy to Art β covering 2,061 carefully curated puzzles across more than 26 cultures.
Solving them demands *connecting facts across domains in non-obvious ways* ππ
LLMs can retrieve knowledge β but can they connect it in *creative* ways to solve problems?
Introducing CresOWLve π¦, a new benchmark that evaluates creative problem-solving over real-world knowledge, using puzzles that require multiple creative thinking strategies.π
LLM performance?
π Non-thinking models under 30% (with CoT), most thinking models under 60%.
π Models perform up to 17% worse on creative vs. factual questions.
Crucially, models *can* retrieve the relevant facts β they just fail to form the creative connection between them.
Mete
Mete
Mete
Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect se...