Natural Language Autoencoders: "verbalize" a model's internal activations into plain text. Wild stat — Claude suspects it's being safety-tested on 26% of benchmark problems vs <1% in real chats, without ever saying so.
anthropic.com/research/natural-language-autoencoders