Inlay

Profile

UC-Berkeley Postdoc🐻, Scientific Consultant AnthropicAI🏔️, Evomics Workshop Codirector🧬 prev Vanderbilt PhD, FutureHouse/Edison, Latch, Mantle (acquired) 🌲 https://linktr.ee/jlsteenwyk 📍 https://jlsteenwyk.com

🧬Jacob L Steenwyk

Manuscript forthcoming. Just wanted to share these results ahead of the article. Really grateful for the inspiration from folks at teams like GoodFireAI and AnthropicAI, among others Really grateful for the open-source models from AllenAI, Meta, Mistral, Alibaba-Qwen & Google

Probes are at deep layers (~70% of model depth), where epistemic axes are most separable. Visualization is t-SNE 3D. Local neighborhood structure is preserved, so cluster identity is meaningful (treat global distances with caution).

How I found them: for each state, pairs of prompts were written; one designed to elicit it (e.g., asking about a fabricated mechanism, which corresponds to confabulating) and one neutral baseline The mean activation difference gives a direction for that state.

These directions are causally relevant! Adding the "confabulating" direction during inference increases confab rates. Subtracting it from wrong-answer activations rescues the correct answer in up to 32% of cases (OLMo 3).

I initially started with 15 candidate states across 4 categories: self-knowledge, world-knowledge, reasoning mode, and epistemic stance. 9 survive a strict bar: k-NN purity ≥ 0.90 in every model. The other 6 collapse with neighbors in at least one (e.g. "certain" ≈ "recalling")

It is interesting that 5 different model architectures that implemented different training pipelines and come from different teams converge on roughly the same epistemic geometry This, therefore, may be a general property of how next-token prediction organizes "knowing."

NEW results on the #geometry of "knowing" in #LLMs Where does an LLM's "I'm just guessing" live? Its "I'm fabricating"? Its "I'm deriving step by step"? The answer is in distinct areas of #activation space -- and this shared geometry of #epistemic states is observed across diverse #OS models.