The AdamW videos compress to 1/6th the size of the Muon videos. Something AdamW is doing allows the crease visualisation to be compressed well, but not Muon. This is the weirdest observation ever.
The weirdest observation: I generated movies visualizing the polytope boundaries for ReLU networks using Muon and AdamW.
Same experiment, same data, same random seed. The difference is the "crease pattern" that the optimizers produce.
The most insightful take on Mythos I've seen so far. Everyone should read this but especially those who are currently thinking through the possible regulatory responses.
www.faz.net/premium/digi...
I wrote a FAZ guest article.
Confession time: I use agentic coding all day, every day. It makes me much more productive.
But I am also terrified of skill atrophy, I feel like I need to break out pen & paper to force myself to "weight-lift" mentally so I don't forget how to think.
How do y'all handle this?