A new paper about how to scale your training of LLMs when increasing the token budget, based on the convergence theory! Lots of empirical experiments validating the assumptions we make. arxiv.org/abs/2603.21191
We study the role of batch size in stochastic conditional gradient methods under a $μ$-Kurdyka-Łojasiewicz ($μ$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e....