Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings

Abstract

AI-generated voice clones are important tools in language learning, audiobooks, and assistive technology, but often struggle to replicate key prosodic features such as dynamic F₀ variation. The impact of these differences on speech perception remain underexplored.

This research conducted two behavioural tasks, evaluating listeners' ratings of naturalness and similarity for human speech, three AI voice clones (ElevenLabs, StyleTTS-2, XTTS-v2), and a 30% F₀ variation condition.

Key Findings:

ElevenLabs was rated comparably to human speech
StyleTTS-2 and XTTS-v2 received lower ratings
Reduced F₀ variation led to lower ratings, suggesting prosody is key to perceived naturalness and similarity
Listener ratings were influenced by speaker accent and sex, but not by AI tool experience

These findings suggest that prosodic features and speaker-specific characteristics could be drivers for the varying performance of AI-voice clones.

Research Significance

This study provides critical insights into:

Voice Clone Detection - Understanding what makes AI voices detectable
Prosodic Authentication - The importance of pitch variation (F₀) in voice identity
Commercial vs Open Source - Performance comparison across different AI voice systems
Accent and Identity - How speaker characteristics affect AI voice replication

Implications for Voice Security

The research demonstrates that current AI voice cloning systems have detectible limitations in prosodic replication. This supports the need for:

Advanced voice authentication systems like VIIM
Prosodic analysis in deepfake detection
Multi-dimensional voice identity verification
Protection systems that account for speaker-specific characteristics

Citation

Bakkouche, L., McGhee, C., Lau, E., Cooper, S., Luo, X., Rees, M., Alter, K., Post, B., & Schwarz, J. (2025). Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings. Interspeech 2025, Rotterdam, The Netherlands.

Conference: Interspeech 2025
Location: Rotterdam, The Netherlands
Dates: 17-21 August 2025
DOI: 10.21437/Interspeech.2025-947

Index Terms: Speech Perception, Speech Synthesis, Human-Computer Interaction, Prosody

Affiliations:

University of Cambridge, United Kingdom
Newcastle University, United Kingdom

Full Paper: View on ISCA Archive

This research was funded by the Cambridge Language Sciences Incubator Fund.