Manuscript forthcoming. Just wanted to share these results ahead of the article.
Really grateful for the inspiration from folks at teams like
GoodFireAI and AnthropicAI, among others
Really grateful for the open-source models from
AllenAI, Meta, Mistral, Alibaba-Qwen & Google
Probes are at deep layers (~70% of model depth), where epistemic axes are most separable.
Visualization is t-SNE 3D. Local neighborhood structure is preserved, so cluster identity is meaningful (treat global distances with caution).
How I found them: for each state, pairs of prompts were written; one designed to elicit it (e.g., asking about a fabricated mechanism, which corresponds to confabulating) and one neutral baseline
The mean activation difference gives a direction for that state.
These directions are causally relevant!
Adding the "confabulating" direction during inference increases confab rates.
Subtracting it from wrong-answer activations rescues the correct answer in up to 32% of cases (OLMo 3).
I initially started with 15 candidate states across 4 categories: self-knowledge, world-knowledge, reasoning mode, and epistemic stance.
9 survive a strict bar: k-NN purity ≥ 0.90 in every model. The other 6 collapse with neighbors in at least one (e.g. "certain" ≈ "recalling")
It is interesting that 5 different model architectures that implemented different training pipelines and come from different teams converge on roughly the same epistemic geometry
This, therefore, may be a general property of how next-token prediction organizes "knowing."
NEW results on the #geometry of "knowing" in #LLMs
Where does an LLM's "I'm just guessing" live? Its "I'm fabricating"? Its "I'm deriving step by step"?
The answer is in distinct areas of #activation space -- and this shared geometry of #epistemic states is observed across diverse #OS models.