a draft paper (for an invited talk at AAAI next month) with a philosophical analysis of work on mechanistic interpretability, with special attention to methods for propositional interpretability.
arxiv.org/abs/2501.15740
Mechanistic interpretability is the program of explaining what AI systems are doing in terms of their internal mechanisms. I analyze some aspects of the program, along with setting out some concrete c...