Most nets use He/Lecun init with single LR Ξ·. As width mββ, theory says
Ξ·βO(1/m)βΉKernel; Ξ·βΟ(1/m)βΉUnstable.
Thus max stable LRβ1/m.
Practice violates this. Optimal LRs are larger (e.g.β1/βm) & models admit feature learning; contradicts kernel predictions. Why? (2/10)
This may explain the practical success of CE over MSE!
CE admits larger LRs β richer feature learning. MSE is restricted to Lazy regime.
Validation: Under Β΅P (where both losses admit feature learning), performance gaps vanish. MSE even seems to have an edge at scale! (7/10)