We first start with establishing a 𝐭𝐡𝐞𝐨𝐫𝐞𝐭𝐢𝐜𝐚𝐥 𝐬𝐚𝐟𝐞𝐭𝐲 𝐠𝐮𝐚𝐫𝐚𝐧𝐭𝐞𝐞:
At Nash Equilibrium, the defender provides safe responses to ANY adversarial input (Theorem 1). This motivates our game-theoretic approach to safety alignment beyond empirical defenses.