Cloned voices are easier to understand in noise than their human originals: the voice cloning intelligibility benefit

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Voice cloning technology has developed at an extremely rapid pace and recent work suggests the quality of synthesis now produces humanlike voices. Recent research on cloned voices focused on listeners’ capabilities to discriminate between cloned, ‘deepfake’ and human voices, due to perils associated with the misuse of this technology. However, the relative intelligibility of cloned and human voices is unknown. We compared the relative intelligibility ten human voices with their ten clones in background noise. Participants (N=80) listened to 80 sentences, 40 human and 40 cloned in +3dB, 0dB, -3dB and -6dB signal-to-noise ratios (SNR) in an online experiment. We found that cloned voices were more intelligible in noise than their human counterparts; up to 20% higher across all four noise levels. We also asked participants to rate the clarity and accent strength of all human and cloned voices; and asked participants to identify which voice was human in an 2AFC task. The cloned voices were rated as having marginally higher clarity and perceived to have a less standard accent. Participants identified the human voices with ~70% accuracy. The acoustic analysis of both types of voices revealed that the intelligibility benefit was linked to voice source characteristics including mean pitch and period variation, plus a smoother voice source and improved harmonic-to-noise ratio in the 500-3500 frequency range in the cloned voices. Our results have implications for applications of cloned voices, such as voice restoration, speech synthesis for non-verbal people, and for people with hearing loss.

Article activity feed