Inlay

Profile

We are a research institute investigating the trajectory of AI for the benefit of society. epoch.ai

Epoch AI

These are the highest scores among models we have run on the recently-released v2 dataset, though our runs of GPT Pro models are on-going. Find all scores on our website. epoch.ai/frontiermat...

12h

Epoch AI

The dataset is much improved. Still, given the complexity of FrontierMath solutions, we can’t be sure that we’ve caught all errors. We plan to conduct additional AI-assisted reviews periodically, using new frontier models, and will correct any additional errors we find.

Claude Fable 5 scores very well on FrontierMath: Tiers 1–4 (v2), reaching 87% on Tiers 1–3 and 88% on Tier 4. This continues a streak of Anthropic models improving rapidly at math.

This project began in April when OpenAI shared with us that they had found more errors than expected when conducting an internal review. Note that OpenAI funded the development of Tiers 1–4 and has exclusive access to about 80% of it, with Epoch holding out the rest.

Following this, we conducted an independent audit. We used GPT-5.5 and Opus 4.7 to flag possible errors and then engaged mathematicians to review these flags. Almost all were determined to be real and severe errors that rendered the problems impossible to solve.

Simple calculation mistakes accounted for the vast majority of errors, typically made when the problem author was extracting the final answer. These include things like off-by-one errors and flipped signs. Some problem statements were also fatally ambiguous.

We also removed 5 problems (2%) from Tiers 1–3 and 7 (15%) from Tier 4. These had more fundamental flaws that we didn’t believe were worth repairing. The higher removal rate for Tier 4 reflects the greater complexity of its problems.

13h

FrontierMath: Tiers 1–4 (v2) is live. We concluded an audit that addressed errors in 42% of problems. Rankings are similar but scores are higher across the board. The current leaders are GPT-5.5 (xhigh) with 85% on Tiers 1–3 and Google’s AI co-mathematician with 76% on Tier 4.

12h

13h

FrontierMath: Tiers 1–4 is now approaching saturation. We believe the future of math benchmarking lies in open problems drawn from real research, like those we’ve collected in FrontierMath: Open Problems. epoch.ai/frontiermat...

13h

We’ve backfilled FrontierMath: Tiers 1–4 (v2) scores for a selection of notable models, including recent Claude Opus models. You can find these on our website. We will add scores for Claude Fable 5 and GPT Pro models shortly. epoch.ai/frontiermat...

13h

Epoch AI