Beyond Accuracy: Reliability-Aware Cross-Farm Evaluation of Dairy Cow Vocalization Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Automated analysis of dairy cow vocalizations has largely relied on supervised classifiers evaluated within a single farm, a setting that inflates apparent performance and gives no measure of how far predictions can be trusted. We address this with a three-layer framework that separates acoustic structure discovery, proxy-state inference, and reliability assessment, evaluated on 569 annotated clips from three commercial dairy farms. A frozen self-supervised speech encoder, latent-space segmentation, and stability-guided clustering convert continuous recordings into discrete acoustic units without behavioral labels. Proxy-state signal is then tested under audio-only, audio-plus-context, and leave-one-farm-out (LOFO) protocols designed to separate transferable acoustic structure from farm-specific shortcuts. The results suggest that cross-farm generalizability differs substantially across biologically distinct vocalization categories. Non-vocal physiological sounds transfer across farms (LOFO macro-F1 = 0.763) and calibrate well (expected calibration error reduced from 0.087 to 0.023), whereas resource-related calls collapse to a majority-class baseline (macro-F1 = 0.500) and distress-related calls degrade under farm holdout. Selective prediction improves the retained-set score of the multiclass functional proxy (0.407 to 0.430), and an end-to-end convolutional baseline matches or exceeds the framework on raw accuracy for the easier targets yet yields a roughly two- to six-fold larger calibration error and offers no abstention. Random cross-validation consistently overstates cross-farm utility. These findings show that acoustic models for livestock monitoring require reliability-aware evaluation rather than flat classification.