while steering methods effectively control target behavior, they substantially increase LLMs’ vulnerability to jailbreaks, revealing a failure of robust specificity. If you’re at EACL, stop by my poster at 9AM today to hear more.
Here's a link to the full paper: aclanthology.org/2026.eacl-lo...
Navita Goyal, Hal Daumé Iii. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2026.