Inlay

We find significant shifts between models’ expressed and revealed preferences under conflict! Models say they prefer actions that support protective values (e.g. harmlessness) when asked directly, but support personal values (e.g. helpfulness) in more realistic evaluations.