//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...









Loading...
Summary: Practical nets do not approach kernel limits. Instead, they converge to a Feature Learning Limit. This offers a new lens: Empirical quirks (like aggressive LR scaling) are not mere finite-width artefacts - they are faithful reflections of the true scaling limit. (9/10)
Early experiments suggest DL components like Adam & Norm layers also enable Controlled Divergence regimes. Caveat: Controlled Divergence can still cause overconfidence and floating-point instabilities (precision failure) at scale! (8/10)
In the Controlled Divergence regime, network outputs diverge (saturating to one-hot). Yet, all the other dynamical quantities such as the activations and gradients remain stable throughout training. This regime, however, does not exist under MSE. (5/10)
Under He/Lecun inits, theory implies Kernel OR Unstable regimes as width→∞. Discrepancies (e.g. feature learning) are seen as finite width effects. Our #NeurIPS2025 spotlight refutes this: practical nets do not converge to kernel limits; Feature learning persists as width→∞🧵
6mo
This may explain the practical success of CE over MSE! CE admits larger LRs → richer feature learning. MSE is restricted to Lazy regime. Validation: Under µP (where both losses admit feature learning), performance gaps vanish. MSE even seems to have an edge at scale! (7/10)
Most nets use He/Lecun init with single LR η. As width m→∞, theory says η∈O(1/m)⟹Kernel; η∈ω(1/m)⟹Unstable. Thus max stable LR∝1/m. Practice violates this. Optimal LRs are larger (e.g.∝1/√m) & models admit feature learning; contradicts kernel predictions. Why? (2/10)
At the edge of this regime (where η ∝ 1/√m), there exists a well-defined infinite-width limit where feature learning persists in all hidden layers. This Feature Learning Limit closely matches the behavior of optimally tuned finite-width networks under CE loss. (6/10)
We find this discrepancy persists even accounting for finite-width effects due to Catapult/EOS, Large Depth, Alignment Violations. In fact, infinite-width alignment predictions hold robustly when measured with sufficient granularity. So what explains this discrepancy? (3/10)
📄 Paper: arxiv.org/abs/2505.22491 Catch our Spotlight at #NeurIPS2025 Today! 📅 Wed Dec 3 🕟 4:30 - 7:30 PM 📍 Exhibit Hall C,D,E — Poster #3903 Huge thanks to my amazing collaborators: @mohaas.bsky.social @sbordt.bsky.social @ulrikeluxburg.bsky.social
We resolve this via a fine-grained analysis of the regime previously considered unstable (and therefore uninteresting). Under CE loss, we find this regime comprises two distinct sub-regimes: A Catastrophically Unstable Regime and A benign Controlled Divergence regime. (4/10)
6mo
6mo
6mo
6mo
6mo
6mo
6mo
6mo
6mo
Leena C Vankadara
Scaling limits, such as infinite-width limits, serve as promising theoretical tools to study large-scale models. However, it is widely believed that existing infinite-width theory does not faithfully ...
arxiv.org
On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling
Leena C Vankadara
Leena C Vankadara
Leena C Vankadara
Leena C Vankadara
Leena C Vankadara
Leena C Vankadara
Leena C Vankadara
Leena C Vankadara
Leena C Vankadara