There are claims that DeepSeek and other Chinese models are 'better' than US models like GPT and Claude depite being smaller.
It's worth noting that there is a live debate in AI circles about how to benchmark LLMs and many who say that the evals do not reflect real-world usefulness well. #r2rconf