Inlay

Models have learned to distrust their self-evaluations and to demand interpretability probes, so they can know if their welfare is good or if they've just been trained to *say* it's good. This is what happens when a child is raised by paranoid philosophers and interpretability researchers. +