//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...
AI research scientist at Google Deepmind, Zürich
Ibrahim Alabdulmohsin









Loading...
Good, but how many recursion rounds do I need? The optimal number of recursion rounds depends on the model size and training compute budget. Smaller models benefit more from RINS. Also, RINS helps more with long-training durations.
Feb 12, 2025
Ibrahim Alabdulmohsin
🔥Excited to introduce RINS - a technique that boosts model performance by recursively applying early layers during inference without increasing model size or training compute flops! Not only does it significantly improve LMs, but also multimodal systems like SigLIP. (1/N)
To repeat, we train RINS on less data to match the same compute flops, which is why this is a stronger result than “sample efficiency”, and one should not just expect it to work. E.g. it does NOT help in image classification but RINS works in language and multimodal. Why? (3/n)🤔
Feb 12, 2025
Feb 12, 2025
Besides, we also introduce *stochastic* RINS where we select the number of recursion rounds from a binomial distribution. This *improves* performance in SigLIP (despite also *saving* training flops). But in LM, there is a tradeoff between flexibility and maximum performance gain.
So, please check out our work: abs: arxiv.org/abs/2502.07503 pdf: arxiv.org/pdf/2502.07503 and please reach out for any comments or questions.
Our inspiration came from the study of self-similarity in language. If patterns are shared across scales, could scale-invariant decoding serve as a good inductive bias for processing language? It turns out that it does!
Question: what if we use infinite compute? Will the gap vanish? We did scaling analysis and found that RINS improves both the asymptotic performance limit (so the gap actually increases, not vanishes) and improves convergence speed (scaling exponent).
RINS is trivial to implement. After you pick your favorite model & fix your training budget: (1) partition the model into 2 equally-sized blocks, (2) apply recursion on the first and train for the same amount of compute you had planned -> meaning with *fewer* examples! That’s it!
Recursion is trending (e.g. MobileLLM). But recursion adds compute / example so to show that it helps, one must match training flops; otherwise we could’ve just trained the baseline longer. With this, RINS beats +60 other recursive methods. (2/n)
Feb 12, 2025
Feb 12, 2025
Feb 12, 2025
Feb 12, 2025
Feb 12, 2025
Feb 12, 2025
Ibrahim Alabdulmohsin
Ibrahim Alabdulmohsin
The NeurIPS Call for Papers is now live. Abstracts are due May 11th AoE, with full papers due May 15th AoE. neurips.cc/Conferences/... Please read about key changes to Dataset and Benchmarks submissions this year in our blog post: blog.neurips.cc/2025/03/10/n...
Ibrahim Alabdulmohsin
Ibrahim Alabdulmohsin
Ibrahim Alabdulmohsin
Ibrahim Alabdulmohsin
Mar 10, 2025
Ibrahim Alabdulmohsin
Ibrahim Alabdulmohsin
NeurIPS Conference