ASBAR: an Animal Skeleton-Based Action Recognition framework. Recognizing great ape behaviors in the wild using pose estimation

Michael Fuchs
Emilie Genty
Klaus Zuberbühler
Paul Cotofrei

Curated by eLife

eLife Assessment

This important study presents a new framework (ASBAR) that combines open-source toolboxes for pose estimation and behavior recognition to automate the process of categorizing behaviors in wild apes from video data. The authors present compelling evidence that this pipeline can categorize simple wild ape behaviors from out-of-context video at a similar level of accuracy as previous models, while simultaneously vastly reducing the size of the model. The study's results should be of particular interest to primatologists and other behavioral biologists working with natural populations.

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (eLife)
@pemartin's saved articles (pemartin)

Abstract

The study and classification of animal behaviors have traditionally relied on direct human observation or video analysis, processes that are labor-intensive, time-consuming, and prone to human bias. Advances in machine learning for computer vision, particularly in pose estimation and action recognition, offer transformative potential to enhance the understanding of animal behaviors. However, the integration of these technologies for behavior recognition remains underexplored, particularly in natural settings.

We introduce ASBAR (Animal Skeleton-Based Action Recognition), a novel framework that integrates pose estimation and behavior recognition into a cohesive pipeline. To demonstrate its utility, we tackled the challenging task of classifying natural behaviors of great apes in the wild.

Our approach leverages the OpenMonkeyChallenge dataset, one of the largest open-source primate pose datasets, to train a robust pose estimation model using DeepLabCut. Subsequently, we extracted skeletal motion data from the PanAf500 dataset, a collection of in-the-wild videos of gorillas and chimpanzees annotated with nine behavior categories. Using PoseConv3D from MMAction2, we trained a skeleton-based action recognition model, achieving a Top-1 accuracy of 75.3%. This performance is comparable to previous video-based methods while reducing input data size by approximately 20-fold, offering significant advantages in computational efficiency and storage.

To support further research, we provide an open-source, terminal-based GUI for training and evaluation, along with a dataset of 5,440 annotated keypoints for replication and extension to other species and behaviors.

All models, code, and data are publicly available at: https://github.com/MitchFuchs/asbar

Version published to 10.7554/elife.97962.2 on eLife
Jul 21, 2025
Version published to 10.7554/elife.97962 on eLife
Jul 21, 2025
eLife
Jul 18, 2025

eLife Assessment

This important study presents a new framework (ASBAR) that combines open-source toolboxes for pose estimation and behavior recognition to automate the process of categorizing behaviors in wild apes from video data. The authors present compelling evidence that this pipeline can categorize simple wild ape behaviors from out-of-context video at a similar level of accuracy as previous models, while simultaneously vastly reducing the size of the model. The study's results should be of particular interest to primatologists and other behavioral biologists working with natural populations.

Read the original source
eLife
Jul 18, 2025

Reviewer #1 (Public review):

Summary:

Advances in machine vision and computer learning have meant that there are now state-of-the-art and open-source toolboxes that allow for animal pose estimation and action recognition. These technologies have the potential to revolutionize behavioral observations of wild primates but are often held back by labor intensive model training and the need for some programming knowledge to effectively leverage such tools. The study presented here by Fuchs et al unveils a new framework (ASBAR) that aims to automate behavioral recognition in wild apes from video data. This framework combines robustly trained and well tested pose estimate and behavioral action recognition models. The framework performs admirably at the task of automatically identifying simple behaviors of wild apes from camera trap videos of …

Reviewer #1 (Public review):

Summary:

Advances in machine vision and computer learning have meant that there are now state-of-the-art and open-source toolboxes that allow for animal pose estimation and action recognition. These technologies have the potential to revolutionize behavioral observations of wild primates but are often held back by labor intensive model training and the need for some programming knowledge to effectively leverage such tools. The study presented here by Fuchs et al unveils a new framework (ASBAR) that aims to automate behavioral recognition in wild apes from video data. This framework combines robustly trained and well tested pose estimate and behavioral action recognition models. The framework performs admirably at the task of automatically identifying simple behaviors of wild apes from camera trap videos of variable quality and contexts. These results indicate that skeletal-based action recognition offers a reliable and lightweight methodology for studying ape behavior in the wild and the presented framework and GUI offer an accessible route for other researchers to utilize such tools.

Given that automated behavior recognition in wild primates will likely be a major future direction within many subfields of primatology, open-source frameworks, like the one presented here, will present a significant impact on the field and will provide a strong foundation for others to build future research upon.

Strengths:

Clearly articulated the argument as to why the framework was needed and what advantages it could convey to the wider field.

For a very technical paper it was very well written. Every aspect of the framework the authors clearly explained why it was chosen and how it was trained and tested. This information was broken down in a clear and easily digestible way that will be appreciated by technical and non-technical audiences alike.

The study demonstrates which pose estimation architectures produce the most robust models for both within context and out of context pose estimates. This is invaluable knowledge for those wanting to produce their own robust models.

The comparison of skeletal-based action recognition with other methodologies for action recognition are helpful in contextualizing the results.

Weaknesses:

While I note that this is a paper most likely aimed at the more technical reader, it will also be of interest to a wider primatological readership, including those who work extensively in the field. When outlining the need for future work I felt the paper offered almost exclusively very technical directions. This may have been a missed opportunity to engage the wider readership and suggest some practical ways those in the field could collect more ASBAR friendly video data to further improve accuracy.

Comments on latest version:

I think the new version is an improvement and applaud the authors on a well-written article that conveys some very technical details excellently. The authors have addressed my initial comments about reaching out to a wider, sometimes less technical, primatological audience by encouraging researchers to create large annotated datasets and make these publicly accessible. I also agree that fostering interdisciplinary collaboration is the best way to progress this field of research. These additions have certainly strengthened the paper but I still think some more practical advice for the actual collection of high-quality training data used to improve the pose estimates and behavioral classification in tough out-of-context environments could have been added. This doesn't detract from the quality of the paper though.

Read the original source
eLife
Jul 18, 2025

Reviewer #2 (Public review):

Fuchs et al. propose a framework for action recognition based on pose estimation. They integrate functions from DeepLabCut and MMAction2, two popular machine learning frameworks for behavioral analysis, in a new package called ASBAR.

They test their framework by:

Running pose estimation experiments on the OpenMonkeyChallenge (OMC) dataset (the public train + val parts) with DeepLabCut

Also annotating around 320 images pose data in the PanAf dataset (which contains behavioral annotations). They show that the ResNet-152 model generalizes best from the OMC data to this out-of-domain dataset.

They then train a skeleton-based action recognition model on PanAf and show that the top-1/3 accuracy is slightly higher than video-based methods

Read the original source
eLife
Jul 18, 2025

Author Response:

The following is the authors’ response to the original reviews.

Reviewer #1 (Public Review)

Summary:

Advances in machine vision and computer learning have meant that there are now state-of-the-art and open-source toolboxes that allow for animal pose estimation and action recognition. These technologies have the potential to revolutionize behavioral observations of wild primates but are often held back by labor-intensive model training and the need for some programming knowledge to effectively leverage such tools. The study presented here by Fuchs et al unveils a new framework (ASBAR) that aims to automate behavioral recognition in wild apes from video data. This framework combines robustly trained and well-tested pose estimate and behavioral action recognition models. The framework performs admirably at the task of …

Author Response:

The following is the authors’ response to the original reviews.

Reviewer #1 (Public Review)

Summary:

Advances in machine vision and computer learning have meant that there are now state-of-the-art and open-source toolboxes that allow for animal pose estimation and action recognition. These technologies have the potential to revolutionize behavioral observations of wild primates but are often held back by labor-intensive model training and the need for some programming knowledge to effectively leverage such tools. The study presented here by Fuchs et al unveils a new framework (ASBAR) that aims to automate behavioral recognition in wild apes from video data. This framework combines robustly trained and well-tested pose estimate and behavioral action recognition models. The framework performs admirably at the task of automatically identifying simple behaviors of wild apes from camera trap videos of variable quality and contexts. These results indicate that skeletal-based action recognition offers a reliable and lightweight methodology for studying ape behavior in the wild and the presented framework and GUI offer an accessible route for other researchers to utilize such tools.

Given that automated behavior recognition in wild primates will likely be a major future direction within many subfields of primatology, open-source frameworks, like the one presented here, will present a significant impact on the field and will provide a strong foundation for others to build future research upon.

Strengths:

Clearly articulated the argument as to why the framework was needed and what advantages it could convey to the wider field.

For a very technical paper it was very well written. Every aspect of the framework the authors clearly explained why it was chosen and how it was trained and tested. This information was broken down in a clear and easily digestible way that will be appreciated by technical and non-technical audiences alike.

The study demonstrates which pose estimation architectures produce the most robust models for both within-context and out-of-context pose estimates. This is invaluable knowledge for those wanting to produce their own robust models.

The comparison of skeletal-based action recognition with other methodologies for action recognition helps contextualize the results.

We thank Reviewer #1 for their thoughtful and constructive review of our manuscript. We are especially grateful for your recognition of the clarity of the manuscript, the strength of the technical framework, and its accessibility to both technical and non-technical audiences. Your feedback highlights exactly the kind of interdisciplinary engagement we hope to foster with this work.

Weaknesses

While I note that this is a paper most likely aimed at the more technical reader, it will also be of interest to a wider primatological readership, including those who work extensively in the field. When outlining the need for future work I felt the paper offered almost exclusively very technical directions. This may have been a missed opportunity to engage the wider readership and suggest some practical ways those in the field could collect more ASBAR-friendly video data to further improve accuracy.

We appreciate this insightful suggestion and fully agree that emphasizing practical relevance is important for engaging a broader readership. In response, we have reformulated the opening of the Discussion section to place stronger emphasis on the value of shared, open-source resources and the real-world accessibility of the ASBAR framework. The revised text explicitly highlights the practical benefits of ASBAR for field researchers working in resource-constrained environments, and underscores the importance of community-driven data sharing to advance behavioral research in natural settings.

This section now reads: Despite the growing availability of open-source resources, such as large-scale animal pose datasets and machine learning toolboxes for pose estimation and human skeleton-based action recognition, their integration for animal behavior recognition—particularly in natural settings—remains largely unexplored. With ASBAR, a framework combining animal pose estimation and skeleton-based action recognition, we provide a comprehensive data and model pipeline, methodology, and GUI to assist researchers in automatically classifying animal behaviors via pose estimation. We hope these resources will become valuable tools for advancing the understanding of animal behavior within the research community.

To illustrate ASBAR’s capabilities, we applied it to the challenging task of classifying great ape behaviors in their natural habitat. Our skeletonbased approach achieved accuracy comparable to previous video-based studies for Top-K and Mean Class Accuracies. Additionally, by reducing the input size of the action recognition model by a factor of approximately 20 compared to video-based methods, our approach requires significantly less computational power, storage space, and data transfer resources. These qualities make ASBAR particularly suitable for field researchers working in resource-constrained environments.

Our framework and results are built on the foundation of shared and open-source materials, including tools like DeepLabCut, MMAction2, and datasets such as OpenMonkeyChallenge and PanAf500. This underscores the importance of making resources publicly available, especially in primatology, where data scarcity often impedes progress in AI-assisted methodologies. We strongly encourage researchers with large annotated video datasets to make them publicly accessible to foster interdisciplinary collaboration and further advancements in animal behavior research.

Reviewer #2 (Public Review)

Fuchs et al. propose a framework for action recognition based on pose estimation. They integrate functions from DeepLabCut and MMAction2, two popular machine-learning frameworks for behavioral analysis, in a new package called ASBAR.

They test their framework by

Running pose estimation experiments on the OpenMonkeyChallenge (OMC) dataset (the public train + val parts) with DeepLabCut.

Annotating around 320 image pose data in the PanAf dataset (which contains behavioral annotations). They show that the ResNet-152 model generalizes best from the OMC data to this out-of-domain dataset.

They then train a skeleton-based action recognition model on PanAf and show that the top-1/3 accuracy is slightly higher than video-based methods (and strong), but that the mean class accuracy is lower - 33% vs 42%. Likely due to the imbalanced class frequencies. This should be clarified. For Table 1, confidence intervals would also be good (just like for the pose estimation results, where this is done very well).

We thank Reviewer #2 for their clear and helpful summary of our work, and for the thoughtful suggestions to improve the manuscript. We appreciate this observation. In the revised manuscript, we now clarify that the lower Mean Class Accuracy (MCA) in the initial version was indeed driven by significant class imbalance in the PanAf dataset, which contains highly uneven representation across behavior categories. To address this, we made two key improvements to the action recognition model:

(1) We replaced the standard cross-entropy loss with a class-balanced focal loss, following the approach of Sakib et al. (2021), to better account for rare behaviors during training.

(2) We initialized the PoseConv3D model with pretrained weights from FineGym (Shao et al., 2020) rather than training from scratch, which increased performance across underrepresented classes.

Together, these changes substantially improved model performance on tail classes, increasing the Mean Class Accuracy from 33.6% to 47%, now exceeding that of the videobased baseline.

Moreover, we sincerely thank Reviewer #2 for the thorough and constructive private feedback. Your comments have greatly helped us improve both the structure and clarity of the manuscript, and we have implemented several key revisions based on your recommendations to streamline the text and sharpen its focus on the core contributions. In particular, we have revised the tone of both the Introduction and Discussion sections to more modestly and accurately reflect the scope of our findings. We removed unnecessary implementation details—such as the description of graph-based models that were not part of the final pipeline—to avoid distracting tangents. The Methods section has been clarified and consolidated to include all evaluation metrics, a description of the data augmentation, and other methodological elements that were previously scattered across the Results section. Additionally, the Discussion now explicitly addresses the limitations of our EfficientNet results, including a dedicated paragraph that acknowledges the use of suboptimal hyperparameters and highlights the need for architecture-specific tuning, particularly with respect to learning rate schedules.

Read the original source
Version published to 10.7554/elife.97962.1 on eLife
Aug 2, 2024
eLife
Aug 1, 2024

eLife assessment

This valuable study presents a new framework (ASBAR) that combines open-source toolboxes for pose estimation and behavior recognition to automate the process of categorizing behaviors in wild apes from video data. The authors present compelling evidence that this pipeline can categorize simple wild ape behaviors from out-of-context video at a similar level of accuracy as previous models, while simultaneously vastly reducing the size of the model. The study's results should be of particular interest to primatologists and other behavioral biologists working with natural populations.

Read the original source
eLife
Aug 1, 2024

Reviewer #1 (Public Review):

Summary:

Advances in machine vision and computer learning have meant that there are now state-of-the-art and open-source toolboxes that allow for animal pose estimation and action recognition. These technologies have the potential to revolutionize behavioral observations of wild primates but are often held back by labor-intensive model training and the need for some programming knowledge to effectively leverage such tools. The study presented here by Fuchs et al unveils a new framework (ASBAR) that aims to automate behavioral recognition in wild apes from video data. This framework combines robustly trained and well-tested pose estimate and behavioral action recognition models. The framework performs admirably at the task of automatically identifying simple behaviors of wild apes from camera trap videos of …

Reviewer #1 (Public Review):

Summary:

Advances in machine vision and computer learning have meant that there are now state-of-the-art and open-source toolboxes that allow for animal pose estimation and action recognition. These technologies have the potential to revolutionize behavioral observations of wild primates but are often held back by labor-intensive model training and the need for some programming knowledge to effectively leverage such tools. The study presented here by Fuchs et al unveils a new framework (ASBAR) that aims to automate behavioral recognition in wild apes from video data. This framework combines robustly trained and well-tested pose estimate and behavioral action recognition models. The framework performs admirably at the task of automatically identifying simple behaviors of wild apes from camera trap videos of variable quality and contexts. These results indicate that skeletal-based action recognition offers a reliable and lightweight methodology for studying ape behavior in the wild and the presented framework and GUI offer an accessible route for other researchers to utilize such tools.

Given that automated behavior recognition in wild primates will likely be a major future direction within many subfields of primatology, open-source frameworks, like the one presented here, will present a significant impact on the field and will provide a strong foundation for others to build future research upon.

Strengths:

- Clearly articulated the argument as to why the framework was needed and what advantages it could convey to the wider field.

- For a very technical paper it was very well written. Every aspect of the framework the authors clearly explained why it was chosen and how it was trained and tested. This information was broken down in a clear and easily digestible way that will be appreciated by technical and non-technical audiences alike.

- The study demonstrates which pose estimation architectures produce the most robust models for both within-context and out-of-context pose estimates. This is invaluable knowledge for those wanting to produce their own robust models.

- The comparison of skeletal-based action recognition with other methodologies for action recognition helps contextualize the results.

Weaknesses

While I note that this is a paper most likely aimed at the more technical reader, it will also be of interest to a wider primatological readership, including those who work extensively in the field. When outlining the need for future work I felt the paper offered almost exclusively very technical directions. This may have been a missed opportunity to engage the wider readership and suggest some practical ways those in the field could collect more ASBAR-friendly video data to further improve accuracy.

Read the original source
eLife
Aug 1, 2024

Reviewer #2 (Public Review):

Fuchs et al. propose a framework for action recognition based on pose estimation. They integrate functions from DeepLabCut and MMAction2, two popular machine-learning frameworks for behavioral analysis, in a new package called ASBAR.

They test their framework by

- Running pose estimation experiments on the OpenMonkeyChallenge (OMC) dataset (the public train + val parts) with DeepLabCut.

- Annotating around 320 image pose data in the PanAf dataset (which contains behavioral annotations). They show that the ResNet-152 model generalizes best from the OMC data to this out-of-domain dataset.

- They then train a skeleton-based action recognition model on PanAf and show that the top-1/3 accuracy is slightly higher than video-based methods (and strong), but that the mean class accuracy is lower - 33% vs 42%. Likely …

Reviewer #2 (Public Review):

Fuchs et al. propose a framework for action recognition based on pose estimation. They integrate functions from DeepLabCut and MMAction2, two popular machine-learning frameworks for behavioral analysis, in a new package called ASBAR.

They test their framework by

- Running pose estimation experiments on the OpenMonkeyChallenge (OMC) dataset (the public train + val parts) with DeepLabCut.

- Annotating around 320 image pose data in the PanAf dataset (which contains behavioral annotations). They show that the ResNet-152 model generalizes best from the OMC data to this out-of-domain dataset.

- They then train a skeleton-based action recognition model on PanAf and show that the top-1/3 accuracy is slightly higher than video-based methods (and strong), but that the mean class accuracy is lower - 33% vs 42%. Likely due to the imbalanced class frequencies. This should be clarified. For Table 1, confidence intervals would also be good (just like for the pose estimation results, where this is done very well).

Read the original source
Version published to 10.1101/2023.09.24.559236 on bioRxiv
Sep 25, 2023

Multi-Person 2D Human Pose Estimation: A Benchmark for Real-Time Applications

This article has 4 authors:
1. Tomislav Prusina
2. Juraj Benić
3. Domagoj Ševerdija
4. Domagoj Matijević
This article has no evaluationsLatest version Jul 23, 2025
Swin-CATPN: A Context-Enhanced Temporal Pyramid Network with Swin Transformer for Action Recognition

This article has 4 authors:
1. Hanyou Huang
2. Changnan Jiang
3. Ziyuan Zhang
4. Heqing Ouyang
This article has no evaluationsLatest version Sep 2, 2025
Does Human-Like Contextual Object Recognition Emerge from Language Supervision and Language-Guided Inference?

This article has 3 authors:
1. Karim Rajaei
2. Radoslaw Martin Cichy
3. Hamid Soltanian-Zadeh
This article has no evaluationsLatest version Jul 24, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Multi-Person 2D Human Pose Estimation: A Benchmark for Real-Time Applications

Swin-CATPN: A Context-Enhanced Temporal Pyramid Network with Swin Transformer for Action Recognition

Does Human-Like Contextual Object Recognition Emerge from Language Supervision and Language-Guided Inference?