Voice "cloning" is style transfer.
Across three widely used systems — ElevenLabs V3, Coqui-XTTS, Chatterbox — clones don't just copy speakers, they reshape them to be warmer, more authoritative, more native English-like, and even more “humanlike”.
Relevant and disturbing:
Video
Thanks for presenting this!! @carolin-holtermann.bsky.social
I'll be in Barcelona next week for #CHI2026! 🇪🇸
Excited to see you all there! DM if you want to meet up and chat more!
Maria Antoniak
Cloning also homogenizes identity, and speakers become harder to tell apart.
For example, our speakers came from 22 language backgrounds, yet voice clones get pulled towards five Anglophone varieties: US, UK, Canadian, Australian, NZ English.
Non-native English speakers read the Grandfather Passage. With explicit consent, we cloned each recording with three TTS systems and had US-English annotators rate paired source/clone clips on a 1–5 Likert scale across 7 dimensions — unaware of which clip was human or synthetically generated.
Every dimension shifted in the same direction. Clones were rated:
+19% more authoritative
+14% warmer
+20% more customer-service-like
+14% more “humanlike”
+33% more "native English"
Listeners reported +18% higher *trust* in the cloned voices and +17% more willingness to have an intimate convo.
🔗 Paper, paired audio, code: kzhou-cloud.github.io/voice-clonin...
w/ Federico Bianchi @mbartelds.bsky.social Anna Pot @jameszou.bsky.social @togetherai.bsky.social
A person's voice is a deeply personal marker of identity. Although voice cloning has misuse risks, there are also many legitimate and consensual uses for this technology. There, unfaithful cloning carries its own harm — distorting how a person sounds, erasing personal identity traits.
What happens if you clone a clone? It turns out the style transfer compounds: across iterative rounds, clones drift further from the original embedding and converge into a smaller, shared region.
Every dimension shifted in the same direction. Clones were rated:
+19% more authoritative
+14% warmer
+20% more customer-service-like
+14% more “humanlike”
+33% more "native English"
Listeners reported +18% higher *trust* in the cloned voices and +17% more willingness to have an intimate convo.