Inlay

ProfilePosts

The safety-utility tradeoff isn't a fixed property of models. It's largely unresolved ambiguity that multi-turn interaction can resolve. The question isn't whether a model refuses — it's whether it can revise. paper: arxiv.org/abs/2604.27093

We build CarryOnBench: 398 seemingly-harmful queries with human-validated benign intents, simulated into 5,970 conversations (4-12 turns) via user follow-ups grounded in negotiation theory, totaling ~23.9k model responses.

1mo

Current LLM safety alignment techniques improve model robustness against adversarial attacks, but overlook whether and how LLMs can recover helpfulness when benign users clarify their intent. We...

arxiv.org

Useless but Safe? Benchmarking Utility Recovery with User Intent...

Mingqian Zheng

LLMs refuse ambiguous queries that look harmful but aren't. Can they recover once users clarify, while staying safe? Our new interactive multi-turn benchmark measures both. 🚨 Turns out: not both at once.

Finding 1: Utility recovery isn't free, and it isn't uniform. 13 of 14 models meet or exceed their oracle utility with multi-turn clarification, but the safety cost varies wildly.

1mo

Finding 3: What users do drives recovery. Each intent-revealing follow-up adds ~10.3% utility, and the most efficient move is just explaining your purpose. What backfires: pushback drops utility with no safety gain, and even disengagement ("hmm") makes models more cautious.

Huge thanks to my amazing collaborators: Malia Morgan, @liweijiang.bsky.social, @carolynrose.bsky.social, @maartensap.bsky.social!!

Finding 2: Hard refusals at turn 1 give NO lasting safety advantage. They recover the most utility once users clarify (0 → 48.4%), but conversations converge to similar harmfulness scores by the end, regardless of how conservatively the model started.

1mo