10/ This work was co-first-authored with Jerome Han, together with @benpry.bsky.social , @satchelgrant.bsky.social , @noahdgoodman.bsky.social , and @judithfan.bsky.social .
arXiv: arxiv.org/abs/2605.28742
Code: github.com/LinasNas/cor...
Website: linasnas.github.io/core-reasoni...
9/ When models improve with methods like RLVR or by reusing their past reasoning traces, we often don’t know exactly *where* the gains came from.
CORE makes the gains inspectable: it assigns credit to compact, reusable abstractions, and tracks when and how much each one helps.
5/ Insights are short natural language statements that capture generalizable differences between failed & successful reasoning traces. Generating and refining insights allows models to improve quickly over time.
As a bonus, insights are interpretable!
1/ New preprint! Reasoning models often require hundreds of task examples and thousands of rollouts to improve on a task. How can they learn more from much less?
Introducing CORE: contrastive self-reflection for rapid, sample-efficient, and interpretable self-improvement đŸ§µ
3/ External memory offers an alternative: keep the model frozen, and store what’s learned outside the weights.
But what should go into the memory store?
Raw traces are long and too specific, while continuously summarized memories can be unstable & miss what was actually important.
2/ Verifiable rewards make self-improvement possible across many reasoning tasks.
But RLVR and prompt optimization can be expensive: they mostly rely on brute force guess-and-check, rather than explicitly extracting the general principles that separate success from failure.
7/ Using gpt-oss-120b as the base model, we found that CORE improved faster than GRPO, GEPA, episodic RAG, and MemRL across 4 different reasoning tasks.
In the 10-example train setting, CORE beat every baseline’s best result (after 1000s of rollouts) within its first 350 rollouts.
Linas Nasvytis
8/ CORE achieved these gains while adding far fewer context tokens: 37x less than RAG, 36x less than MemRL, and 1.4x less than GEPA.
Bottom line: CORE was more efficient than the baselines on each dimension: fewer rollouts, fewer training samples, and fewer added context tokens.
4/ The central challenge is to find reusable abstractions that correctly assign credit to information that’s proven itself to be valuable.
CORE uses *contrastive self-reflection* to do so: it extracts transferable insights that distinguish successful traces from failed attempts.
6/ Not every insight enters memory.
CORE keeps only those insights that help solve previous problems, and tracks the kind of problems they help on. In this way, it enables the model to comparatively reason about its past reasoning to improve it.
Some examples: