Crosstalk Suppression in a Multi-Channel, Multi-Speaker System Using Acoustic Vector Sensors
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Automatic speech recognition in a scenario with multiple speakers in a reverberant space, such as a small courtroom, often requires multiple sensors. This leads to a problem of crosstalk that must be removed before the speech-to-text transcription is performed. The proposed method uses Acoustic Vector Sensors to acquire audio streams. Speaker detection is performed using statistical analysis of the direction of arrival. This information is then used to perform source separation. Next, speakers’ activity in each channel is analyzed, and signal fragments containing direct speech and crosstalk are identified. Crosstalk is then suppressed using a dynamic gain processor, and the resulting audio streams may be passed to a speech recognition system. The algorithm was evaluated using a custom set of speech recordings. An increase in SI-SDR value over the unprocessed signal was achieved: 7.54 dB and 19.53 dB for the algorithm with and without the source separation stage, respectively. The algorithm is intended for application in multi-speaker scenarios requiring speech-to-text transcription, such as court sessions or conferences.