Stream DiscStream Disc
Back to Newsroom
Research Publication

Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings

·By Linda Bakkouche, Charles McGhee, Emily Lau, Stephanie Cooper, Xinbing Luo, Madeleine Rees, Kai Alter, Brechtje Post, Julia Schwarz·Research & Publications

Publication: Interspeech 2025, Rotterdam, The Netherlands

Year: 2025

Originally published at Interspeech 2025

Abstract

AI-generated voice clones are important tools in language learning, audiobooks, and assistive technology, but often struggle to replicate key prosodic features such as dynamic F₀ variation. The impact of these differences on speech perception remain underexplored.

This research conducted two behavioural tasks, evaluating listeners' ratings of naturalness and similarity for human speech, three AI voice clones (ElevenLabs, StyleTTS-2, XTTS-v2), and a 30% F₀ variation condition.

Key Findings:

  • ElevenLabs was rated comparably to human speech
  • StyleTTS-2 and XTTS-v2 received lower ratings
  • Reduced F₀ variation led to lower ratings, suggesting prosody is key to perceived naturalness and similarity
  • Listener ratings were influenced by speaker accent and sex, but not by AI tool experience

These findings suggest that prosodic features and speaker-specific characteristics could be drivers for the varying performance of AI-voice clones.

Research Significance

This study provides critical insights into:

  1. Voice Clone Detection - Understanding what makes AI voices detectable
  2. Prosodic Authentication - The importance of pitch variation (F₀) in voice identity
  3. Commercial vs Open Source - Performance comparison across different AI voice systems
  4. Accent and Identity - How speaker characteristics affect AI voice replication

Implications for Voice Security

The research demonstrates that current AI voice cloning systems have detectible limitations in prosodic replication. This supports the need for:

  • Advanced voice authentication systems like VIIM
  • Prosodic analysis in deepfake detection
  • Multi-dimensional voice identity verification
  • Protection systems that account for speaker-specific characteristics

Citation

Bakkouche, L., McGhee, C., Lau, E., Cooper, S., Luo, X., Rees, M., Alter, K., Post, B., & Schwarz, J. (2025). Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings. Interspeech 2025, Rotterdam, The Netherlands.

Conference: Interspeech 2025
Location: Rotterdam, The Netherlands
Dates: 17-21 August 2025
DOI: 10.21437/Interspeech.2025-947

Index Terms: Speech Perception, Speech Synthesis, Human-Computer Interaction, Prosody


Affiliations:

  • University of Cambridge, United Kingdom
  • Newcastle University, United Kingdom

Full Paper: View on ISCA Archive

This research was funded by the Cambridge Language Sciences Incubator Fund.