How much does a language model forget when finetuned on new tasks? We show both model size and optimization matter and forgetting can be nearly eliminated with self-generated replay!
arxiv.org/abs/2605.26097
w/Martin Marek, Dongkyu Cho, Shikai Qiu, Rumi Chunara, and Pavel Izmailov. 1/8