Great to see clarification comments. o3 is impressive nonetheless.
Played around with o1 and the βthinkingβ Gemini model. The cot output (for Gemini) can confusing and convoluted, but it got 3/5 problems right. Stopped on the remaining 2.
These models are an impressive interpretability test bed.
Looks like Teslaβs models sometimes confuse train tracks with road lanes.