jamesbrandecon.github.io/blog/posts_h...
So after having GPT-5.4 agents review 4,800 proofs and GPT-5.5 give a second review to a random subset of them, what did we find? To my absolute surprise, if anything the rate of errors in proofs has increased(!) in 2025 relative to previous years.