Inlay

This may explain the practical success of CE over MSE! CE admits larger LRs → richer feature learning. MSE is restricted to Lazy regime. Validation: Under µP (where both losses admit feature learning), performance gaps vanish. MSE even seems to have an edge at scale! (7/10)