Inlay

Most nets use He/Lecun init with single LR η. As width m→∞, theory says η∈O(1/m)⟹Kernel; η∈ω(1/m)⟹Unstable. Thus max stable LR∝1/m. Practice violates this. Optimal LRs are larger (e.g.∝1/√m) & models admit feature learning; contradicts kernel predictions. Why? (2/10)

6mo

This may explain the practical success of CE over MSE! CE admits larger LRs → richer feature learning. MSE is restricted to Lazy regime. Validation: Under µP (where both losses admit feature learning), performance gaps vanish. MSE even seems to have an edge at scale! (7/10)

Leena C Vankadara

6mo

Leena C Vankadara