LLM performance?
š Non-thinking models under 30% (with CoT), most thinking models under 60%.
š Models perform up to 17% worse on creative vs. factual questions.
Crucially, models *can* retrieve the relevant facts ā they just fail to form the creative connection between them.