One of my favorite findings: Positional embeddings are just training wheels. They help convergence but hurt long-context generalization.
We found that if you simply delete them after pretraining and recalibrate for <1% of the original budget, you unlock massive context windows. Smarter, not harder.
hardmaru
Introducing DroPE: Extending Context by Dropping Positional Embeddings
We found embeddings like RoPE aid training but bottleneck long-sequence generalization. Our solution’s simple: treat them as a temporary training scaffold, not a permanent necessity.
arxiv.org/abs/2512.12167
pub.sakana.ai/DroPE