Following this, we conducted an independent audit. We used GPT-5.5 and Opus 4.7 to flag possible errors and then engaged mathematicians to review these flags. Almost all were determined to be real and severe errors that rendered the problems impossible to solve.