SAM-Whistle: Adapting Foundation Models for Automated Dolphin Whistle Detection
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Automated detection of calls is essential to bioacoustic research where it is routine to collect large data sets that preclude effective human annotation. Automated analysis remains challenging due to variations in calls, varying noise backgrounds, and limited annotation data. The recent emergence of very large foundation models has demonstrated exceptional zero-shot and few-shot generalization capabilities, often matching or surpassing specialized models that are fine-tuned for specific domains. We demonstrate that such models can be adapted to bioacoustic tasks. We adapt segment anything model (SAM), a foundational vision model designed for image segmentation, to identify energy in a time-freqeuncy representation of whistle vocalizations produced by toothed whales. Analysis of these predictions by an existing tracking algorithm can improve retrieval of fine-grained call characteristics, generating time-frequency contours that follow individual whistles with high fidelity. The model, SAM-whistle, leverages SAM’s pretrained vision transformer network as an encoder network that is adapted to generate spectrogram embeddings that capture information about whistles. The resultant embeddings are decoded using a convolutional decoder that produces time-frequency masks that indicate the presence of whistle energy. Individual whistle contours are produced from the time-frequency masks by the tracking algorithm. Comprehensive evaluations demonstrate that SAM-whistle has strong performance, achieving an F1 score of 0.897 (precision 0.920, recall 0.876) and demonstrating consistency across a large range of detection thresholds. The model is effective at retrieving nearly complete whistle contours, producing annotations that typically cover 89% of analyst-annotated contours, and producing an expected 1.15 detections for each whistle, indicating that most whistles are not fragmented into multiple detections. Qualitative analysis further demonstrates the effectiveness of SAM-whistle in capturing long-duration whistles and distinguishing unique acoustic patterns. These results highlight the potential of adapting foundation models for specialized bioacoustic analysis tasks, providing a scalable and efficient solution for the automated detection of wildlife vocalizations.