//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
Most nets use He/Lecun init with single LR η. As width m→∞, theory says η∈O(1/m)⟹Kernel; η∈ω(1/m)⟹Unstable. Thus max stable LR∝1/m. Practice violates this. Optimal LRs are larger (e.g.∝1/√m) & models admit feature learning; contradicts kernel predictions. Why? (2/10)
6mo
Leena C Vankadara