//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
This may explain the practical success of CE over MSE! CE admits larger LRs → richer feature learning. MSE is restricted to Lazy regime. Validation: Under µP (where both losses admit feature learning), performance gaps vanish. MSE even seems to have an edge at scale! (7/10)
6mo
Leena C Vankadara