//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfileReplies









Loading...
Huge thanks to my amazing collaborators: Malia Morgan, @liweijiang.bsky.social, @carolynrose.bsky.social, @maartensap.bsky.social!!
The safety-utility tradeoff isn't a fixed property of models. It's largely unresolved ambiguity that multi-turn interaction can resolve. The question isn't whether a model refuses — it's whether it can revise. paper: arxiv.org/abs/2604.27093
Finding 3: What users do drives recovery. Each intent-revealing follow-up adds ~10.3% utility, and the most efficient move is just explaining your purpose. What backfires: pushback drops utility with no safety gain, and even disengagement ("hmm") makes models more cautious.
Finding 2: Hard refusals at turn 1 give NO lasting safety advantage. They recover the most utility once users clarify (0 → 48.4%), but conversations converge to similar harmfulness scores by the end, regardless of how conservatively the model started.
Finding 1: Utility recovery isn't free, and it isn't uniform. 13 of 14 models meet or exceed their oracle utility with multi-turn clarification, but the safety cost varies wildly.