When and how can test-time thinking allow models to use information latent in their training data? What are the benefits and tradeoffs relative to other solutions like synthetic data augmentation? Pleased to share (after a long delay) an exploration of these issues: arxiv.org/abs/2604.01430 thread:
Language Models (LMs) exhibit two distinct mechanisms for knowledge acquisition: in-weights learning (i.e., encoding information within the model weights) and in-context learning (ICL). Although these...