Fast and accurate annotation of acoustic signals with deep neural networks

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This paper presents and evaluates a machine learning method for segmenting and annotating animal acoustic communication signals. The paper presents results from applying the method to signals from Drosophila, mice, and songbirds, but the method should be useful for a broad range of researchers who record animal vocalizations. The method appears to be easily generalizable and has high throughput and modest training times.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their names with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Acoustic signals serve communication within and across species throughout the animal kingdom. Studying the genetics, evolution, and neurobiology of acoustic communication requires annotating acoustic signals: segmenting and identifying individual acoustic elements like syllables or sound pulses. To be useful, annotations need to be accurate, robust to noise, and fast.

We here introduce DeepAudioSegmenter ( DAS) , a method that annotates acoustic signals across species based on a deep-learning derived hierarchical presentation of sound. We demonstrate the accuracy, robustness, and speed of DAS using acoustic signals with diverse characteristics from insects, birds, and mammals. DAS comes with a graphical user interface for annotating song, training the network, and for generating and proofreading annotations. The method can be trained to annotate signals from new species with little manual annotation and can be combined with unsupervised methods to discover novel signal types. DAS annotates song with high throughput and low latency for experimental interventions in realtime. Overall, DAS is a universal, versatile, and accessible tool for annotating acoustic communication signals.

Article activity feed

  1. Evaluation Summary:

    This paper presents and evaluates a machine learning method for segmenting and annotating animal acoustic communication signals. The paper presents results from applying the method to signals from Drosophila, mice, and songbirds, but the method should be useful for a broad range of researchers who record animal vocalizations. The method appears to be easily generalizable and has high throughput and modest training times.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their names with the authors.)

  2. Reviewer #1 (Public Review):

    Review:

    The manuscript "Fast and accurate annotation of acoustic signals with deep neural networks" by Elsa Steinfath, Adrian Palacios, Julian Rottschäfer, Deniz Yuezak, and Jan Clemens describes a new piece of software that, building on previous work, trains a deep network classifier to segment audio signals. The main advances are speed (real time), works on standard hardware, and a user-friendly interface, which allows users with little machine learning experience to train and use a deep neural network classifier.

    I. Results

    A. How good is it?

    1. How fast is it?
    a. How long to train?
    Train time depends on the amount of data, but the ranges quotes (10 minutes to 5 hours) are quite reasonable. It works on reasonable hardware (I tested on a laptop with a GPU).

    b. How long to classify?
    Latency to classification is between 7-15ms, which is a little long for triggered optogenetics, but not bad, and certainly reasonable for acoustic feedback.

    2. How accurate is it?
    a. In absolute terms
    Accuracy is improved relative to Fly Song Segmenter, particularly in recall (Arthur et al., 2013; Coen et al., 2014).
    Pulse song:
    DeepSS precision: 97%, recall: 96%
    FlySongSegmenter: precision: 99%, recall 87%.
    Sine song:
    DeepSS precision: 92%, recall: 98%
    FlySongSegmenter: precision: 91%, recall: 91%.

    b. One main concern I have is that all the signals described, with the exception of pulse song, are relatively simple tonally. Bengalese finch song is much less noisy than zebra finch song. Mouse vocalizations are quite tonal. How would this method work on acoustic signals with noise components, like zebra finches or some non-human primate signals? Some signals can have variable spectrotemporal structure based on the distortion due to increased intensity of the signal (see, for example, Fitch, Neubauer, & Hertzel, 2002).

    W.Tecumseh Fitch, Jürgen Neubauer & Hanspeter Herzel (2002) "Calls out of chaos: the adaptive significance of nonlinear phenomena in mammalian vocal production" Animal Behaviour, 63: 407-418. doi:10.1006/anbe.2001.1912

    B. How easy to use?

    0. "our method can be optimized for new species without requiring expert knowledge and with little manual annotation work." There isn't a lot of explanation, either in the paper or in the associated documentation, of how to select network parameters for a new vocalization type. However, it does appear that small amounts of annotation are sufficient to train a reasonable classifier.

    1. How much pre-processing of signals is necessary?
    All the claims of the paper are based on pre-processed audio data, although they state, in the Methods section that preprocessing is not necessary. It's not clear how important this pre-processing is for achieving the kinds of accuracy observed. Certainly I would expect the speed to drop if high frequency signals like mouse vocalizations aren't downsampled. However, I tried it on raw, un-preprocessed mouse vocalizations, without downsampling and using very few training examples, and it worked quite well, only missing low signal-to-noise vocalizations.

    C. How different from other things out there?

    It would strengthen the paper to include some numbers on other mouse and birdsong methods, rather than simple and vague assertions "These performance values compare favorably to that of methods specialized to annotate USVs (Coffey et al., 2019; Tachibana et al., 2020; Van Segbroeck et al., 2017)." "Thus, DeepSS performs as well as or better than specialized deep learning-based methods for annotating bird song (Cohen et al., 2020; Koumura and Okanoya, 2016).

    D. Miscellaneous comments

    1. Interestingly, the song types don't appear to be mutually exclusive. One can have pulse song in the middle of sine song. That might be useful to be able to toggle...I can imagine cases where it would be nice to be able to label things that overlap, but in general if something is sine song, it can't be pulse song. And my assumption certainly was that song types would be mutually exclusive. Adding some explanation of that to the text/user's manual would be useful.

    2. How information is combined across channels is alluded to several times but not described well in the body of the manuscript, though it is mentioned in the methods in vague terms:
    "several channel convolutions, 𝑘𝛾(1, 𝛾), combine information across channels."

    II. Usability

    A. Getting it installed
    Installing on Windows 10, was a bit involved if you were not already using python: Anaconda, python, tensorflow, CUDA libraries, create an account to download cuDNN, and update NVIDIA drivers.

  3. Reviewer #2 (Public Review):

    Steinfath et al. developed a new toolbox to segment animal communication within acoustic signals. Specifically, they use temporal convolutional neural networks with dilated convolutions -- this is an excellent algorithmic choice: not least since WaveNets have demonstrated the power of those building blocks for acoustic signal processing in 2016, and they are now running in every modern phone.

    Steinfath et al. evaluated the performance of their toolbox for courtship songs in flies, ultrasonic vocalizations in mice and songs of Bengalese birds. While the performance is generally convincing, the specific examples do not seem to be (very) challenging pattern recognition problems and comparisons to other tools are lacking. In particular, for the examples regarding mice and birds, no other methods are considered (or quantitatively discussed), not even simple baselines such as SVMs applied to spectrograms. Another concern is that currently the toolbox only works for single animals, but often (especially) for courtship songs at least a couple is involved.

    Overall, this is an appealing tool with good performance. The toolbox seems well written, boasts a GUI for annotation and proof-reading and a clean documentation. Thus, it might be broadly used within the auditory community.

  4. Reviewer #3 (Public Review):

    The authors have developed a software package called Deep Song Segmenter (DeepSS) that uses the deep learning framework known as temporal convolutional networks to segment and annotate communication signals. The package is intended to be a general and flexible framework for annotating communication signals in many experimental settings. The paper reports performance on recordings from drosophila, mice and songbirds. The classification and detection performance across these diverse data sets is quite good, generally in the mid to high 90 percent. This suggests that the framework could be useful for a wide range of researchers studying communication signals.

    Strengths:

    In addition to overall classification, the authors do a good job of answering the most important additional questions about their package's performance. Namely, they show that: (i) classification/detection does not degrade dramatically at lower signal to noise ratios; (ii) segmentation is temporally precise, generally in the submillisecond range for temporally well-defined events; (iii) classification is fast after training and can be run on a standard PC, facilitating use in real time; and (iv) good classification performance can be achieved with a relatively modest number of hand annotations (generally in the 100s).

    Weaknesses:

    There are two main weaknesses of this paper. First, although the authors claim that the method compares favorably to other machine learning methods, they only provide head-to-head performance comparisons for drosophila songs. There are no direct comparisons against other methods for mouse ultrasonic or songbird vocalizations, nor is there a readable summary of performance numbers from the literature. This makes it difficult for the reader to assess the authors' claim that DeepSS is a significant advance over the state of the art.

    Second, the authors provide little discussion about optimizing network parameters for a given data set. If the software is to be useful for the non-expert, a broader discussion of considerations for setting parameters would be useful. How should one choose the stack number, or the size or number of kernels? Moreover, early in the paper DeepSS is touted as a method that learns directly from the raw audio data, in contrast to methods that rely on Fourier or Wavelet transforms. Yet in the methods it is revealed that a short-term Fourier front-end was used for both the mouse and songbird data.