Inlay

//

ProfilePosts

Loading...

These are the highest scores among models we have run on the recently-released v2 dataset, though our runs of GPT Pro models are on-going. Find all scores on our website. epoch.ai/frontiermat...

This project began in April when OpenAI shared with us that they had found more errors than expected when conducting an internal review. Note that OpenAI funded the development of Tiers 1–4 and has exclusive access to about 80% of it, with Epoch holding out the rest.

3d

Claude Fable 5 scores very well on FrontierMath: Tiers 1–4 (v2), reaching 87% on Tiers 1–3 and 88% on Tier 4. This continues a streak of Anthropic models improving rapidly at math.

Simple calculation mistakes accounted for the vast majority of errors, typically made when the problem author was extracting the final answer. These include things like off-by-one errors and flipped signs. Some problem statements were also fatally ambiguous.

Following this, we conducted an independent audit. We used GPT-5.5 and Opus 4.7 to flag possible errors and then engaged mathematicians to review these flags. Almost all were determined to be real and severe errors that rendered the problems impossible to solve.

FrontierMath: Tiers 1–4 (v2) is live. We concluded an audit that addressed errors in 42% of problems. Rankings are similar but scores are higher across the board. The current leaders are GPT-5.5 (xhigh) with 85% on Tiers 1–3 and Google’s AI co-mathematician with 76% on Tier 4.

3d

The dataset is much improved. Still, given the complexity of FrontierMath solutions, we can’t be sure that we’ve caught all errors. We plan to conduct additional AI-assisted reviews periodically, using new frontier models, and will correct any additional errors we find.