Speaker Diarization: A Review of Objectives and Methods
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recorded audio often contains speech from multiple people in conversation. It is useful to label such signals with speaker turns, noting when each speaker is talking and identifying each speaker. This paper discusses how to process speech signals to do such speaker diarization (SD). We examine the nature of speech signals, to identify the possible acoustical features that could assist this clustering task. Traditional speech analysis techniques are reviewed, as well as measures of spectral similarity and clustering. Speech activity detection requires separating speech from background noise in general audio signals. SD may use stochastic models (hidden Markov and Gaussian mixture) and embeddings such as x-vectors. Modern neural machine learning methods are examined in detail. Suggestions are made for future improvements.