This may explain the practical success of CE over MSE!
CE admits larger LRs → richer feature learning. MSE is restricted to Lazy regime.
Validation: Under µP (where both losses admit feature learning), performance gaps vanish. MSE even seems to have an edge at scale! (7/10)