Inlay

//

ProfilePosts

Loading...

The safety-utility tradeoff isn't a fixed property of models. It's largely unresolved ambiguity that multi-turn interaction can resolve. The question isn't whether a model refuses — it's whether it can revise. paper: arxiv.org/abs/2604.27093

We build CarryOnBench: 398 seemingly-harmful queries with human-validated benign intents, simulated into 5,970 conversations (4-12 turns) via user follow-ups grounded in negotiation theory, totaling ~23.9k model responses.

Finding 1: Utility recovery isn't free, and it isn't uniform. 13 of 14 models meet or exceed their oracle utility with multi-turn clarification, but the safety cost varies wildly.

Finding 3: What users do drives recovery. Each intent-revealing follow-up adds ~10.3% utility, and the most efficient move is just explaining your purpose. What backfires: pushback drops utility with no safety gain, and even disengagement ("hmm") makes models more cautious.

Finding 2: Hard refusals at turn 1 give NO lasting safety advantage. They recover the most utility once users clarify (0 → 48.4%), but conversations converge to similar harmfulness scores by the end, regardless of how conservatively the model started.

We introduce Ben-Util, a new checklist-based metric that captures the user's safe info need. With it, we identify three failure modes single-turn evals can't see: utility lock-in, unsafe recovery, and repetitive recovery.

Huge thanks to my amazing collaborators: Malia Morgan, @liweijiang.bsky.social, @carolynrose.bsky.social, @maartensap.bsky.social!!