Inlay

LLM performance? 📉 Non-thinking models under 30% (with CoT), most thinking models under 60%. 📉 Models perform up to 17% worse on creative vs. factual questions. Crucially, models *can* retrieve the relevant facts — they just fail to form the creative connection between them.