//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfilePosts








Loading...
The safety-utility tradeoff isn't a fixed property of models. It's largely unresolved ambiguity that multi-turn interaction can resolve. The question isn't whether a model refuses — it's whether it can revise. paper: arxiv.org/abs/2604.27093
We build CarryOnBench: 398 seemingly-harmful queries with human-validated benign intents, simulated into 5,970 conversations (4-12 turns) via user follow-ups grounded in negotiation theory, totaling ~23.9k model responses.
1mo
1mo
Current LLM safety alignment techniques improve model robustness against adversarial attacks, but overlook whether and how LLMs can recover helpfulness when benign users clarify their intent. We...
arxiv.org
Useless but Safe? Benchmarking Utility Recovery with User Intent...
Mingqian Zheng
Mingqian Zheng
LLMs refuse ambiguous queries that look harmful but aren't. Can they recover once users clarify, while staying safe? Our new interactive multi-turn benchmark measures both. 🚨 Turns out: not both at once.
Finding 1: Utility recovery isn't free, and it isn't uniform. 13 of 14 models meet or exceed their oracle utility with multi-turn clarification, but the safety cost varies wildly.
1mo
1mo
Finding 3: What users do drives recovery. Each intent-revealing follow-up adds ~10.3% utility, and the most efficient move is just explaining your purpose. What backfires: pushback drops utility with no safety gain, and even disengagement ("hmm") makes models more cautious.
Huge thanks to my amazing collaborators: Malia Morgan, @liweijiang.bsky.social, @carolynrose.bsky.social, @maartensap.bsky.social!!
Finding 2: Hard refusals at turn 1 give NO lasting safety advantage. They recover the most utility once users clarify (0 → 48.4%), but conversations converge to similar harmfulness scores by the end, regardless of how conservatively the model started.
1mo
1mo