Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings
Publication: Interspeech 2025, Rotterdam, The Netherlands
Year: 2025
Originally published at Interspeech 2025
Abstract
AI-generated voice clones are important tools in language learning, audiobooks, and assistive technology, but often struggle to replicate key prosodic features such as dynamic F₀ variation. The impact of these differences on speech perception remain underexplored.
This research conducted two behavioural tasks, evaluating listeners' ratings of naturalness and similarity for human speech, three AI voice clones (ElevenLabs, StyleTTS-2, XTTS-v2), and a 30% F₀ variation condition.
Key Findings:
- ElevenLabs was rated comparably to human speech
- StyleTTS-2 and XTTS-v2 received lower ratings
- Reduced F₀ variation led to lower ratings, suggesting prosody is key to perceived naturalness and similarity
- Listener ratings were influenced by speaker accent and sex, but not by AI tool experience
These findings suggest that prosodic features and speaker-specific characteristics could be drivers for the varying performance of AI-voice clones.
Research Significance
This study provides critical insights into:
- Voice Clone Detection - Understanding what makes AI voices detectable
- Prosodic Authentication - The importance of pitch variation (F₀) in voice identity
- Commercial vs Open Source - Performance comparison across different AI voice systems
- Accent and Identity - How speaker characteristics affect AI voice replication
Implications for Voice Security
The research demonstrates that current AI voice cloning systems have detectible limitations in prosodic replication. This supports the need for:
- Advanced voice authentication systems like VIIM
- Prosodic analysis in deepfake detection
- Multi-dimensional voice identity verification
- Protection systems that account for speaker-specific characteristics
Citation
Bakkouche, L., McGhee, C., Lau, E., Cooper, S., Luo, X., Rees, M., Alter, K., Post, B., & Schwarz, J. (2025). Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings. Interspeech 2025, Rotterdam, The Netherlands.
Conference: Interspeech 2025
Location: Rotterdam, The Netherlands
Dates: 17-21 August 2025
DOI: 10.21437/Interspeech.2025-947
Index Terms: Speech Perception, Speech Synthesis, Human-Computer Interaction, Prosody
Affiliations:
- University of Cambridge, United Kingdom
- Newcastle University, United Kingdom
Full Paper: View on ISCA Archive
This research was funded by the Cambridge Language Sciences Incubator Fund.