//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...
We are a research institute investigating the trajectory of AI for the benefit of society. epoch.ai
Epoch AI









Loading...
These are the highest scores among models we have run on the recently-released v2 dataset, though our runs of GPT Pro models are on-going. Find all scores on our website. epoch.ai/frontiermat...
12h
Epoch AI
The dataset is much improved. Still, given the complexity of FrontierMath solutions, we can’t be sure that we’ve caught all errors. We plan to conduct additional AI-assisted reviews periodically, using new frontier models, and will correct any additional errors we find.
Claude Fable 5 scores very well on FrontierMath: Tiers 1–4 (v2), reaching 87% on Tiers 1–3 and 88% on Tier 4. This continues a streak of Anthropic models improving rapidly at math.
This project began in April when OpenAI shared with us that they had found more errors than expected when conducting an internal review. Note that OpenAI funded the development of Tiers 1–4 and has exclusive access to about 80% of it, with Epoch holding out the rest.
Following this, we conducted an independent audit. We used GPT-5.5 and Opus 4.7 to flag possible errors and then engaged mathematicians to review these flags. Almost all were determined to be real and severe errors that rendered the problems impossible to solve.
Simple calculation mistakes accounted for the vast majority of errors, typically made when the problem author was extracting the final answer. These include things like off-by-one errors and flipped signs. Some problem statements were also fatally ambiguous.
We also removed 5 problems (2%) from Tiers 1–3 and 7 (15%) from Tier 4. These had more fundamental flaws that we didn’t believe were worth repairing. The higher removal rate for Tier 4 reflects the greater complexity of its problems.
13h
FrontierMath: Tiers 1–4 (v2) is live. We concluded an audit that addressed errors in 42% of problems. Rankings are similar but scores are higher across the board. The current leaders are GPT-5.5 (xhigh) with 85% on Tiers 1–3 and Google’s AI co-mathematician with 76% on Tier 4.
12h
13h
13h
13h
13h
FrontierMath: Tiers 1–4 is now approaching saturation. We believe the future of math benchmarking lies in open problems drawn from real research, like those we’ve collected in FrontierMath: Open Problems. epoch.ai/frontiermat...
13h
We’ve backfilled FrontierMath: Tiers 1–4 (v2) scores for a selection of notable models, including recent Claude Opus models. You can find these on our website. We will add scores for Claude Fable 5 and GPT Pro models shortly. epoch.ai/frontiermat...
13h
13h
Epoch AI
Epoch AI
Epoch AI
Epoch AI
Epoch AI
Epoch AI
Epoch AI
Epoch AI
Epoch AI