//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfilePosts









Loading...
Most nets use He/Lecun init with single LR Ξ·. As width mβ†’βˆž, theory says η∈O(1/m)⟹Kernel; Ξ·βˆˆΟ‰(1/m)⟹Unstable. Thus max stable LR∝1/m. Practice violates this. Optimal LRs are larger (e.g.∝1/√m) & models admit feature learning; contradicts kernel predictions. Why? (2/10)
6mo
This may explain the practical success of CE over MSE! CE admits larger LRs β†’ richer feature learning. MSE is restricted to Lazy regime. Validation: Under Β΅P (where both losses admit feature learning), performance gaps vanish. MSE even seems to have an edge at scale! (7/10)
Leena C Vankadara
6mo
Leena C Vankadara