BehaviorDEPOT is a simple, flexible tool for automated behavioral detection based on markerless pose tracking

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This paper is of potential interest to researchers performing animal behavioral quantification with computer vision tools. The manuscript introduces 'BehaviorDEPOT', a MATLAB application and GUI intended to facilitate quantification and analysis of freezing behavior from behavior movies, along with several other classifiers based on movement statistics calculated from animal pose data. The paper describes how the tool can be applied to several specific types of experiments, and emphasizes the ease of use - particularly for groups without experience in coding or behavioral quantification. While these aims are laudable, and the software is relatively easy to use, further improvements to make the tool more automated would substantially broaden the likely user base.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Quantitative descriptions of animal behavior are essential to study the neural substrates of cognitive and emotional processes. Analyses of naturalistic behaviors are often performed by hand or with expensive, inflexible commercial software. Recently, machine learning methods for markerless pose estimation enabled automated tracking of freely moving animals, including in labs with limited coding expertise. However, classifying specific behaviors based on pose data requires additional computational analyses and remains a significant challenge for many groups. We developed BehaviorDEPOT (DEcoding behavior based on POsitional Tracking), a simple, flexible software program that can detect behavior from video timeseries and can analyze the results of experimental assays. BehaviorDEPOT calculates kinematic and postural statistics from keypoint tracking data and creates heuristics that reliably detect behaviors. It requires no programming experience and is applicable to a wide range of behaviors and experimental designs. We provide several hard-coded heuristics. Our freezing detection heuristic achieves above 90% accuracy in videos of mice and rats, including those wearing tethered head-mounts. BehaviorDEPOT also helps researchers develop their own heuristics and incorporate them into the software’s graphical interface. Behavioral data is stored framewise for easy alignment with neural data. We demonstrate the immediate utility and flexibility of BehaviorDEPOT using popular assays including fear conditioning, decision-making in a T-maze, open field, elevated plus maze, and novel object exploration.

Article activity feed

  1. Author Response:

    Reviewer #1 (Public Review):

    This paper is of potential interest to researchers performing animal behavioral quantification with computer vision tools. The manuscript introduces 'BehaviorDEPOT', a MATLAB application and GUI intended to facilitate quantification and analysis of freezing behavior from behavior movies, along with several other classifiers based on movement statistics calculated from animal pose data. The paper describes how the tool can be applied to several specific types of experiments, and emphasizes the ease of use - particularly for groups without experience in coding or behavioral quantification. While these aims are laudable, and the software is relatively easy to use, further improvements to make the tool more automated would substantially broaden the likely user base.

    In this manuscript, the authors introduce a new piece of software, BehaviorDEPOT, that aims to serve as an open source classifier in service of standard lab-based behavioral assays. The key arguments the authors make are that 1) the open source code allows for freely available access, 2) the code doesn't require any coding knowledge to build new classifiers, 3) it is generalizable to other behaviors than freezing and other species (although this latter point is not shown) 4) that it uses posture-based tracking that allows for higher resolution than centroid-based methods, and 5) that it is possible to isolate features used in the classifiers. While these aims are laudable, and the software is indeed relatively easy to use, I am not convinced that the method represents a large conceptual advance or would be highly used outside the rodent freezing community.

    Major points:

    1. I'm not convinced over one of the key arguments the authors make - that the limb tracking produces qualitatively/quantitatively better results than centroid/orientation tracking alone for the tasks they measure. For example, angular velocities could be used to identify head movements. It would be good to test this with their data (could you build a classifier using only the position/velocity/angular velocities of the main axis of the body?
    1. This brings me to the point that the previous state-of-the-art open-source methodology, JAABA, is barely mentioned, and I think that a more direct comparison is warranted, especially since this method has been widely used/cited and is also aimed at a not-coding audience.

    Here we address points 1 and 2 together. JAABA has been widely adopted by the drosophila community with great success. However, we noticed that fewer studies use JAABA to study rodents. The ones that did typically examined social behaviors or gross locomotion, usually in an empty arena such as an open field or a standard homecage. In a study of mice performing reaching/grasping tasks against complex backgrounds, investigators modified the inner workings of JAABA to classify behavior (Sauerbrei et al., 2020), an approach that is largely inaccessible to inexperienced coders. This suggested to us that it may be challenging to implement JAABA for many rodent behavioral assays.

    We directly compared BehaviorDEPOT to JAABA and determined that BehaviorDEPOT outperforms JAABA in several ways. First, we used MoTr and Ctrax (the open-source centroid tracking software packages that are typically used with JAABA) to track animals in videos we had recorded previously. Both MoTr and Ctrax could fit ellipses to mice in an open field, in which the mouse is small relative to the environment and runs against a clean white background. However, consistent with previous reports (Geuther et al., Comm. Bio, 2019), MoTr and Ctrax performed poorly when rodents were fear conditioning chambers which have high contrast bars on the floor (Fig. 10A–C). These tracking-related hurdles may explain, at least in part, why relatively few rodent studies have employed JAABA.

    We next tried to import our DeepLabCut (DLC) tracking data into JAABA. The JAABA website instructs users to employ Animal Part Tracker (https://kristinbranson.github.io/APT/) to convert DLC outputs into a format that is compatible with JAABA. We discovered that APT was not compatible with the current version of DLC, an insurmountable hurdle for labs with limited coding expertise. We wrote our own code to estimate a centroid from DLC keypoints and fed the data into JAABA to train a freezing classifier. Even when we gave JAABA more training data than we used to develop BehaviorDEPOT classifiers (6 videos vs. 3 videos), BehaviorDEPOT achieved higher Recall and F1 scores (Fig. 10D).

    In response to point 1, we also trained a VTE classifier with JAABA. When we tested its performance on a separate set of test videos, JAABA could not distinguish VTE vs. non-VTE trials. It labeled every trial as containing VTE (Fig. 10E), indicating that a fitted ellipse is not sufficient to detect fine angular head movements. JAABA has additional limitations as well. For instance, JAABA reports the occurrence of behavior in a video timeseries but does not allow researchers to analyze the results of experiments. BehaviorDEPOT shares features of programs like Ethovision or ANYmaze in that it can classify behaviors and also report their occurrence with reference to spatial and temporal cues. These direct comparisons address some of the key concerns centered around the advances BehaviorDEPOT offers beyond JAABA. They also highlight the need for new behavioral analysis software targeted towards a noncoding audience, particularly in the rodent domain.

    1. Remaining on JAABA: while the authors' classification approach appeared to depend mostly on a relatively small number of features, JAABA uses boosting to build a very good classifier out of many not-so-good classifiers. This approach is well-worn in machine learning and has been used to good effect in highthroughput behavioral data. I would like the authors to comment on why they decided on the classification strategy they have.

    We built algorithmic classifiers around keypoint tracking because of the accuracy flexibility and speed it affords. Like many behavior classification programs, JAABA relies on tracking algorithms that use background subtraction (MoTr) or pattern classifiers (Ctrax) to segment animals from the environment and then abstract their position to an ellipse. These methods are highly sensitive to changes the experimental arena and cannot resolve fine movement of individual body parts (Geuther et al., Comm. Bio, 2019; Pennington et al., Sci. Rep. 2019; Fig. 10A). Keypoint tracking is more accurate and less sensitive to environmental changes. Models can be trained to detect animals in any environment, so researchers can analyze videos they have already collected. Any set of body parts can be tracked and fine movements such as head turns can be easily resolved (Fig. 10E).

    Keypoint tracking can be used to simultaneously track the location of animals and classify a wide range of behaviors. Integrated spatial-behavioral analysis is relevant to many assays including fear conditioning, avoidance, T-mazes (decision making), Y-mazes (working memory), open field (anxiety, locomotion), elevated plus maze (anxiety), novel object exploration, and social memory. Quantifying behaviors in these assays requires analysis of fine movements (we now show Novel Object Exploration, Fig. 5 and VTE, Fig. 6 as examples). These behaviors have been carefully defined by expert researchers. Algorithmic classifiers can be created quickly and intuitively based on small amounts of video data (Table 4) and easily tweaked for out of sample data (Fig. 9). Additional rounds of machine learning are time consuming, computationally intensive, and unnecessary, and we show in Figure 10 that JAABA classifiers have higher error rates than BehaviorDEPOT classifiers, even when provided with a larger set of training data. Moreover, while JAABA reports behaviors in video timeseries, BehaviorDEPOT has integrated features that report behavior occurring at the intersection of spatial and temporal cues (e.g. ROIs, optogenetics, conditioned cues), so it can also analyze the results of experiments. The automated, intuitive, and flexible way in which BehaviorDEPOT classifies and quantifies behavior will propel new discoveries by allowing even inexperienced coders to capitalize on the richness of their data.

    Thank you for raising these questions. We did an extensive rewrite of the intro and discussion to ensure these important points are clear.

    1. I would also like more details on the classifiers the authors used. There is some detail in the main text, but a specific section in the Methods section is warranted, I believe, for transparency. The same goes for all of the DLC post-processing steps.

    Apologies for the lack of detail. We included much more detail in both the results and methods sections that describe how each classifier works, how they were developed and validated, and how the DLC post-processing steps work.

    1. It would be good for the authors to compare the Inter-Rater Module to the methods described in the MARS paper (reference 12 here).

    We included some discussion of how BehaviorDEPOT Inter-Rater Module compares to the MARS.

    1. More quantitative discussion about the effect of tracking errors on the classifier would be ideal. No tracking is perfect, so an end-user will need to know "how good" they need to get the tracking to get the results presented here.

    We included a table detailing the specs of our DLC models and the videos that we used for validating our classifiers (Table 4). We also added a paragraph about designing video ‘training’ and test sets to the methods.

    Reviewer #2 (Public Review):

    BehaviorDEPOT is a Matlab-based user interface aimed at helping users interact with animal pose data without significant coding experience. It is composed of several tools for analysis of animal tracking data, as well as a data collection module that can interface via Arduino to control experimental hardware. The data analysis tools are designed for post-processing of DeepLabCut pose estimates and manual pose annotations, and includes four modules: 1) a Data Exploration module for visualizing spatiotemporal features computed from animal pose (such as velocity and acceleration), 2) a Classifier Optimization module for creating hand-fit classifiers to detect behaviors by applying windowing to spatiotemporal features, 3) a Validation module for evaluating performance of classifiers, and 4) an Inter-Rater Agreement module for comparing annotations by different individuals.

    A strength of BehaviorDEPOT is its combination of many broadly useful data visualization and evaluation modules within a single interface. The four experimental use cases in the paper nicely showcase various features of the tool, working the user from the simplest example (detecting optogenetically induced freezing) to a more sophisticated decision-making example in which BehaviorDEPOT is used to segment behavioral recordings into trials, and within trials to count head turns per trial to detect deliberative behavior (vicarious trial and error, or VTE.) The authors also demonstrate the application of their software using several different animal pose formats (including from 4 to 9 tracked body parts) from multiple camera types and framerates.

    1. One point that confused me when reading the paper was whether BehaviorDEPOT was using a single, fixed freezing classifier, or whether the freezing classifier was being tuned to each new setting (the latter is the case.) The abstract, introduction, and "Development of the BehaviorDEPOT Freezing Classifier" sections all make the freezing classifier sound like a fixed object that can be run "out-of-the-box" on any dataset. However, the subsequent "Analysis Module" section says it implements "hard-coded classifiers with adjustable parameters", which makes it clear that the freezing classifier is not a fixed object, but rather it has a set of parameters that can (must?) be tuned by the user to achieve desired performance. It is important to note that the freezing classifier performances reported in the paper should therefore be read with the understanding that these values are specific to the particular parameter configuration found (rather than reflecting performance a user could get out of the box.)

    Our classifier does work quite well “out of the box”. We developed our freezing classifier based on a small number of videos recorded with a FLIR Chameleon3 camera at 50 fps (Fig. 2F). We then demonstrated its high accuracy in three separately acquired data sets (webcam, FLIR+optogenetics, and Minicam+Miniscope, Fig. 2–4, Table 4). The same classifier also had excellent performance in mice and rats from external labs. With minor tweaks to the threshold values, we were able to classify freezing with F1>0.9 (Fig. 9). This means that the predictive value of the metrics we chose (head angular velocity and back velocity) generalizes across experimental setups.

    Popular freezing detection software including FreezeFrame, VideoFreeze as well as the newly created ezTrack also allow users to adjust freezing classifier thresholds. Allowing users to adjust thresholds ensures that the BehaviorDEPOT freezing classifier can be applied to videos that have already been recorded with different resolutions, lighting conditions, rodent species, etc. Indeed, the ability to easily adjust classifier thresholds for out-of-sample data represents one of the main advantages of hand-fitting classifiers. Yet BehaviorDEPOT offers additional advantages above FreezeFrame, VideoFreeze, and ezTrack. For one, it adds a level of rigor to the optimization step by quantifying classifier performance over a range of threshold values, helping users select the best ones. Also, it is free, it can quantify behavior with reference to user-defined spatiotemporal filters, and it can classify and analyze behaviors beyond freezing. We updated the results and discussions sections to make these points clear.

    1. This points to a central component of BehaviorDEPOT's design that makes its classifiers different from those produced by previously published behavior detection software such as JAABA or SimBA. So far as I can tell, BehaviorDEPOT includes no automated classifier fitting, instead relying on the users to come up with which features to use and which thresholds to assign to those features. Given that the classifier optimization module still requires manual annotations (to calculate classifier performance, Fig 7A), I'm unsure whether hand selection of features offers any kind of advantage over a standard supervised classifier training approach. That doesn't mean an advantage doesn't exist- maybe the hand-fit classifiers require less annotation data than a supervised classifier, or maybe humans are better at picking "appropriate" features based on their understanding of the behavior they want to study.

    See response to reviewer 1, point 3 above for an extensive discussion of the rationale for our classification method. See response to reviewer 2 point 3 below for an extensive discussion of the capabilities of the data exploration module, including new features we have added in response to Reviewer 2’s comments.

    1. There is something to be said for helping users hand-create behavior classifiers: it's easier to interpret the output of those classifiers, and they could prove easier to fine-tune to fix performance when given out-ofsample data. Still, I think it's a major shortcoming that BehaviorDEPOT only allows users to use up to two parameters to create behavior classifiers, and cannot create thresholds that depend on linear or nonlinear combinations of parameters (eg, Figure 6D indicates that the best classifier would take a weighted sum of head velocity and change in head angle.) Because of these limitations on classifier complexity, I worry that it will be difficult to use BehaviorDEPOT to detect many more complex behaviors.

    To clarify, users can combine as many parameters as they like to create behavior classifiers. However, the reviewer raises a good point and we have now expanded the functions of the Data Exploration Module. Now, users can choose ‘focused mode’ or ‘broad mode’ to explore their data. In focused mode, researchers use their intuition about behaviors to select the metrics to examine. The user chooses two metrics at a time and the Data Exploration Module compares values between frames where behavior is present or absent and provides summary data and visual representations in the form of boxplots and histograms. A generalized linear model (GLM) also estimates the likelihood that the behavior is present in a frame across a range of threshold values for both selected metrics (Fig. 8A), allowing users to optimize parameters in combination. This process can be repeated for as many metrics as desired.

    In broad mode, the module uses all available keypoint metrics to generate a GLM that can predict behavior. It also rank-orders metrics based on their predictive weights. Poorly predictive metrics are removed from the model if their weight is sufficiently small. Users also have the option to manually remove individual metrics from the model. Once suitable metrics and thresholds have been identified using either mode, users can plug any number and combination of metrics into a classifier template script that we provide and incorporate their new classifier into the Analysis Module. Detailed instructions for integrating new classifiers are available in our GitHub repository (https://github.com/DeNardoLab/BehaviorDEPOT/wiki/Customizing-BehaviorDEPOT).

    MoSeq, JAABA, MARS, SimBA, B-SOiD, DANNCE, and DeepEthogram are among a group of excellent opensource software packages that already do a great job detecting complex behaviors. They use supervised or unsupervised machine learning to detect behaviors that are difficult to see by eye including social interactions and fine-scale grooming behaviors. Instead of trying to improve upon these packages, BehaviorDEPOT is targeting unmet needs of a large group of researchers that study human-defined behaviors and need a fast and easy way to automate their analysis. As examples, we created a classifier to detect vicarious trial and error (VTE), defined by sweeps on the head (Fig. 9). Our revised manuscript also describes our new novel object exploration classifier (Fig. 5). Both behaviors are defined based on animal location and the presence of fine movements that may not be accurately detected by algorithms like MoTr and Ctrax (Fig. 10). As discussed in response to reviewer 1, point 3, additional rounds of machine learning are laborious (humans must label frames as input), computationally intensive, harder to adjust for out-of-sample videos, and are not necessary to quantify these kinds of behaviors.

    1. Finally, I have some concerns about how performance of classifiers is reported. For example, the authors describe "validation" set of videos used to assess freezing classifier performance, but they are very vague about the detector was trained in the first place, stating "we empirically determined that thresholding the velocity of a weighted average of 3-6 body parts ... and the angle of head movements produced the bestperforming freezing classifier." What videos were used to come to this conclusion? It is imperative that when performance values are reported in the paper, they are calculated on a separate set of validation videos, ideally from different animals, that were never referenced while setting the parameters of the classifier. Otherwise, there is a substantial risk of overfitting, leading to overestimation of classifier performance. Similarly, Figure 7 shows the manual fitting of classifiers to rat and mouse data; the fitting process in 7A is shown to include updating parameters and recalculating performance iteratively. This approach is fine, however I want to confirm that the classifier performances in panels 7F-G were computed on videos not used during fitting.

    Thank you for pointing this out. We have included detailed descriptions of the classifier development and validation in the results (149–204) and methods (789–820) sections and added a table that describes videos used to validate each classifier (Table 4).

    To develop the classifier freezing, we explored linear and angular velocity metrics for various keypoints, finding that angular velocity of the head and linear velocity of a back point tracked best with freezing. Common errors in our classifiers were identified as short sequences of frames at the beginning or end of a behavior bout. This may reflect failures in human detection. Other common errors were sequences of false positive or false negative frames that were shorter than a typical behavior bout. We included the convolution algorithm to correct these short error sequences.

    When developing classifiers (including adjust the parameters for the external videos), videos were randomly assigned to classifier development (e.g. ‘training’) and test sets. Dividing up the dataset by video rather than by frame ensures that highly correlated temporally adjacent frames are not sorted into training and test sets, which can cause overestimation of classifier accuracy. Since the videos in the test set were separate from those used to develop the algorithms, our validation data reflects the accuracy levels users can expect from BehaviorDEPOT.

    1. Overall, I like the user-friendly interface of this software, its interaction with experimental hardware, and its support for hand-crafted behavior classification. However, I feel that more work could be done to support incorporation of additional features and feature combinations as classifier input- it would be great if BehaviorDEPOT could at least partially automate the classifier fitting process, eg by automatically fitting thresholds to user-selected features, or by suggesting features that are most correlated with a user's provided annotations. Finally, the validation of classifier performance should be addressed.

    Thank you for the positive feedback on the interface. We addressed these comments in response to points 3 and 4. To recap, we updated the Data Exploration Module to include Generalized Linear Models that can suggest features with the highest predictive value. We also generated template scripts that simplify the process of creating new classifiers and incorporating them into the Analysis Module. We also included all the details of the videos we used to validate classifier performance, which were separate from the videos that we used to determine the parameters (Table 4).

    Reviewer #3 (Public Review): There is a need for standardized pipelines that allow for repeatable robust analysis of behavioral data, and this toolkit provides several helpful modules that researchers will find useful. There are, however, several weaknesses in the current presentation of this work.

    1. It is unclear what the major advance is that sets BehaviorDEPOT apart from other tools mentioned (ezTrack, JAABA, SimBA, MARS, DeepEthogram, etc). A comparison against other commonly used classifiers would speak to the motivation for BehaviorDEPOT - especially if this software is simpler to use and equally efficient at classification.

    We also address this in response to reviewer 1, points 1–3. To summarize, we added direct comparisons with JAABA to a revised manuscript. In Fig. 10, we show that BehaviorDEPOT outperforms JAABA in several ways. First, DLC is better at tracking rodents in complex environments than MoTr and Ctrax, which are the most used JAABA companion software packages for centroid tracking. Second, we show that even when we use DLC to approximate centroids and use this data to train classifiers with JAABA, the BehaviorDEPOT classifiers perform better than JAABA’s.

    In a revised manuscript, we included more discussion of what sets BehaviorDEPOT apart from other software, focusing on these main points:

    BehaviorDEPOT vs. commercially available packages (Ethovision, ANYmaze, FreezeFrame, VideoFreeze)

    1. Ethovision, ANYmaze, FreezeFrame, VideoFreeze cost thousands of dollars per license while BehaviorDEPOT is free.

    2. The BehaviorDEPOT freezing classifier performs robustly even when animals are wearing a tethered patch cord, while VideoFreeze and FreezeFrame often fail under these conditions.

    3. Keypoint tracking is more accurate, flexible, and can resolve more detail compared to those that use background subtraction or pixel change detection algorithms combined with center of mass or fitted ellipses.

    BehaviorDEPOT vs. packages targeted at non-coding audiences (JAABA, ezTrack)

    1. DLC keypoint tracking performs better than MoTr and Ctrax in complex environments. As a result, JAABA has not been widely used in the rodent community. Built around keypoint tracking, BehaviorDEPOT will enable researchers to analyze videos in any type of arena, including videos they have already collected. Keypoint track also allows for detection of finer movements, which is essential for behaviors like VTE and object exploration.

    2. Hand-fit classifiers can be creative quickly and intuitively for well-defined laboratory behaviors. Compared to machine learning-derived classifiers, they are easier to interpret and easier to fine-tune to optimize performance when given out-of-sample data.

    3. Even when using DLC as the input to JAABA, BehaviorDEPOT classifiers perform better (Figure 10)

    4. BehaviorDEPOT integrates behavioral classification, spatial tracking, and quantitative analysis of behavior and position with reference to spatial ROIs and temporal cues of interest. It is flexible and can accommodate varied experimental designs. In ezTrack, spatial tracking is decoupled from behavioral classification. In JAABA, spatial ROIs can be incorporated into machine learning algorithms, but users cannot quantify behavior with reference to spatial ROIs after classification has occurred. Neither JAABA nor ezTrack provide a way to quantify behavior with reference to temporal events (e.g. optogenetic stimuli, conditioned cues).

    5. BehaviorDEPOT includes analysis and visualization tools, providing many features of the costly commercial software packages for free.

    BehaviorDEPOT vs. packages based on keypoint tracking (SimBA, MARS, B-SOiD)

    Other software packages based on keypoint tracking use supervised or unsupervised methods to classify behavior from animal poses. These software packages target researchers studying complex behaviors that are difficult to see by eye including social interactions and fine-scale grooming behaviors whereas BehaviorDEPOT targets a large group of researchers that study human defined behaviors and need a fast and easy way to automate their analysis. Many behaviors of interest will require spatial tracking in combination with detection of specific movements (e.g. VTE, NOE). Additional rounds of machine learning are laborious (humans must label frames as input), computationally intensive, and are not necessary to quantify these kinds of behaviors.

    1. While the idea might be that joint-level tracking should simplify the classification process, the number of markers used in some of the examples is limited to small regions on the body and might not justify using these markers as input data. The functionality of the tool seems to rely on a single type of input data (a small number of keypoints labeled using DeepLabCut) and throws away a large amount of information in the keypoint labeling step. If the main goal is to build a robust freezing detector then why not incorporate image data (particularly when the best set of key points does not include any limb markers)?

    While one main goal was to build a robust freezing detector, BehaviorDEPOT is a general-purpose software. BehaviorDEPOT can classify behaviors from video timeseries and can analyze the results of experiments similar to Ethovision or FreezeFrame. BehaviorDEPOT is particularly useful for assays in which behavioral classification is integrated with spatial location, including avoidance, decision making (T maze), and novel object memory/recognition. While image data is useful for classifying behavior, it cannot combine spatial tracking with behavioral classification. However, DLC keypoint tracking is well-suited for this purpose. We find that tracking 4–8 points is sufficient to hand-fit high performing classifiers for freezing, avoidance, reward choice in a T-maze, VTE, and novel object recognition. Of course, users always have the option to track more points because BehaviorDEPOT simply imports the X-Y coordinates and likelihood scores of any keypoints of interest.

    1. Need a better justification of this classification method

    See response to reviewer 1, points 1–3 above.

    1. Are the thresholds chosen for smoothing and convolution adjusted based on agreement to a user-defined behavior?

    Yes. We added more details in the text. Briefly, users can change the thresholds used in both smoothing and convolution in the GUI and can optimize the values using the Classifier Optimization Module. Smoothing is performed once at the beginning of a session and has an adjustable span for the smoothing window. The convolution is a feature of each classifier, and thus can be adjusted when adjusting the classifier. When developing the freezing classifier, we started with a smoothing window that had the largest value that did not exceed the rate of motion of the animal and then fine-tuned the value to optimize smoothing. In the classifiers we have developed, window widths that are the length of the smallest bout of ‘real’ behavior and count thresholds approximately 1/3 the window width yielded the best results.

    1. Jitter is mentioned as a limiting factor in freezing classifier performance - does this affect human scoring as well?

    We were referring to jitter in terms of point location estimates by DeepLabCut. In other words, networks that are tailored to the specific recording conditions have lower error rates in the estimates of keypoint positions. Human scoring is an independent process that is not affected by this jitter. We changed the wording in the text to avoid any confusion.

    1. The use of a weighted average of body part velocities again throws away information - if one had a very high-quality video setup with more markers would optimal classification be done differently? What if the input instead consisted of 3D data, whether from multi-camera triangulation or other 3D pose estimation? Multianimal data?

    From reviewer 2, point 3: MARS, SimBA, and B-SOiD are excellent open-source software packages that are also based on keypoint tracking. They use supervised or unsupervised methods to classify complex behaviors that are difficult to see by eye including social interactions and fine-scale grooming behaviors. Instead of trying to improve upon these packages, which are already great, BehaviorDEPOT is targeting unmet needs of a large group of researchers that study human defined behaviors and need a fast and easy way to automate their analysis. Additional rounds of machine learning are laborious (humans must label frames as input), computationally intensive, and are not necessary to quantify these kinds of behaviors. However, keypoint tracking offers accuracy, precision and flexibility that is superior to behavioral classification programs that estimate movement based on background subtraction, center of mass, ellipse fitting, etc.

    1. It is unclear where the manual annotation of behavior is used in the tool as currently stands. Is the validation module used to simply say that the freezing detector is as good as a human annotator? One might expect that algorithms which use optic flow or pixel-based metrics might be superior to a human annotator, is it possible to benchmark against one of these? For behaviors other than freezing, a tool to compare human labels seems useful. The procedure described for converging on a behavioral definition is interesting and an example of this in a behavior other than freezing, especially where users may disagree, would be informative. It appears that manual annotation doesn't actually happen in the GUI and a user must create this themselves - this seems unnecessarily complicated.

    Manual annotation of behavior is used in the four classifier development modules: inter-rater, data exploration, optimization, and validation. The inter-rater module can be used as a tool to refine ground-truth behavioral definitions. It imports annotations from any number of raters and generates graphical and text-based statistical reports about overlap, disagreement, etc. Users can use this tool to iteratively refine annotations until they converged maximally. The inter-rater module can be used to compare human labels (or any reference set of annotations) for any behavior. To ensure this is clear to the readers, we added more details to the text and second demonstration of the inter-rater module for novel object exploration annotations (Fig. 7). The validation module imports reference annotations which can be produced by a human or another program, which can benchmark classifier performance against the reference. We added more details to this section as well.

    Freezing is a straightforward behavior that is easy to detect by eye. Rather than benchmark against an optic flow algorithm, we benchmarked against JAABA, another user-friendly behavioral classification software that uses machine learning algorithms. We find that BehaviorDEPOT is easier to use and labels freezing more accurately than JAABA. We also made a second freezing classifier that uses a changepoint algorithm to identify transitions from movement to freezing that may accommodate a wider range of video framerates and resolutions.

    We plan to incorporate an annotation feature into the GUI, but in the interest of disseminating our work soon, we argue that this is not necessary for inclusion now. There are many free or cheap programs that allow framewise annotation of behavior including FIJI, Quicktime, VLC, and MATLAB. In fact, users may already have manual annotations or annotations produced by a different software and BehaviorDEPOT can import these directly. While machine learning classifiers like JAABA require human annotations to be entered into their GUI, allowing people to import annotations they collected previously saves time and effort.

    1. A major benefit of BehaviorDEPOT seems to be the ability to run experiments, but the ease of programming specific experiments is not readily apparent. The examples provided use different recording methods and networks for each experimental context as well as different presentations of data - it is not clear which analyses are done automatically in BehaviorDEPOT and which require customizing code or depend on the MiniCAM platform and hardware. For example - how does synchronization with neural or stimulus data occur? Overall it is difficult to judge how these examples would be implemented without some visual documentation.

    We added visual documentation of the experimental module graphical interface to figure 1 and added more detail to the results, methods and to our GitHub repository (https://github.com/DeNardoLab/Fear-Conditioning-Experiment-Designer). Synchronization with stimulus data can occur within the Experiment Module (designed for fear conditioning experiments) or stimuli timestamps can be easily imported into the Analysis Module. Synchronization with neural data occurs post hoc using the data structures produced by the BehaviorDEPOT Analysis Module. We include our code for aligning behavior to Miniscope on our GitHub repository https://github.com/DeNardoLab/caAnalyze).

  2. Evaluation Summary:

    This paper is of potential interest to researchers performing animal behavioral quantification with computer vision tools. The manuscript introduces 'BehaviorDEPOT', a MATLAB application and GUI intended to facilitate quantification and analysis of freezing behavior from behavior movies, along with several other classifiers based on movement statistics calculated from animal pose data. The paper describes how the tool can be applied to several specific types of experiments, and emphasizes the ease of use - particularly for groups without experience in coding or behavioral quantification. While these aims are laudable, and the software is relatively easy to use, further improvements to make the tool more automated would substantially broaden the likely user base.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

  3. Reviewer #1 (Public Review):

    In this manuscript, the authors introduce a new piece of software, BehaviorDEPOT, that aims to serve as an open source classifier in service of standard lab-based behavioral assays. The key arguments the authors make are that 1) the open source code allows for freely available access, 2) the code doesn't require any coding knowledge to build new classifiers, 3) it is generalizable to other behaviors than freezing and other species (although this latter point is not shown), 4) that it uses posture-based tracking that allows for higher resolution than centroid-based methods, and 5) that it is possible to isolate features used in the classifiers. While these aims are laudable, and the software is indeed relatively easy to use, I am not convinced that the method represents a large conceptual advance or would be highly used outside the rodent freezing community.

    Major points:

    1. I'm not convinced over one of the key arguments the authors make - that the limb tracking produces qualitatively/quantitatively better results than centroid/orientation tracking alone for the tasks they measure. For example, angular velocities could be used to identify head movements. It would be good to test this with their data (could you build a classifier using only the position/velocity/angular velocities of the main axis of the body?

    2. This brings me to the point that the previous state-of-the-art open-source methodology, JAABA, is barely mentioned, and I think that a more direct comparison is warranted, especially since this method has been widely used/cited and is also aimed at a not-coding audience.

    3. Remaining on JAABA: while the authors' classification approach appeared to depend mostly on a relatively small number of features, JAABA uses boosting to build a very good classifier out of many not-so-good classifiers. This approach is well-worn in machine learning and has been used to good effect in high-throughput behavioral data. I would like the authors to comment on why they decided on the classification strategy they have.

    4. I would also like more details on the classifiers the authors used. There is some detail in the main text, but a specific section in the Methods section is warranted, I believe, for transparency. The same goes for all of the DLC post-processing steps.

    5. It would be good for the authors to compare the Inter-Rater Module to the methods described in the MARS paper (reference 12 here).

    6. More quantitative discussion about the effect of tracking errors on the classifier would be ideal. No tracking is perfect, so an end-user will need to know "how good" they need to get the tracking to get the results presented here.

  4. Reviewer #2 (Public Review):

    BehaviorDEPOT is a Matlab-based user interface aimed at helping users interact with animal pose data without significant coding experience. It is composed of several tools for analysis of animal tracking data, as well as a data collection module that can interface via Arduino to control experimental hardware. The data analysis tools are designed for post-processing of DeepLabCut pose estimates and manual pose annotations, and includes four modules: 1) a Data Exploration module for visualizing spatiotemporal features computed from animal pose (such as velocity and acceleration), 2) a Classifier Optimization module for creating hand-fit classifiers to detect behaviors by applying windowing to spatiotemporal features, 3) a Validation module for evaluating performance of classifiers, and 4) an Inter-Rater Agreement module for comparing annotations by different individuals.

    A strength of BehaviorDEPOT is its combination of many broadly useful data visualization and evaluation modules within a single interface. The four experimental use cases in the paper nicely showcase various features of the tool, working the user from the simplest example (detecting optogenetically induced freezing) to a more sophisticated decision-making example in which BehaviorDEPOT is used to segment behavioral recordings into trials, and within trials to count head turns per trial to detect deliberative behavior (vicarious trial and error, or VTE.) The authors also demonstrate the application of their software using several different animal pose formats (including from 4 to 9 tracked body parts) from multiple camera types and framerates.

    One point that confused me when reading the paper was whether BehaviorDEPOT was using a single, fixed freezing classifier, or whether the freezing classifier was being tuned to each new setting (the latter is the case.) The abstract, introduction, and "Development of the BehaviorDEPOT Freezing Classifier" sections all make the freezing classifier sound like a fixed object that can be run "out-of-the-box" on any dataset. However, the subsequent "Analysis Module" section says it implements "hard-coded classifiers with adjustable parameters", which makes it clear that the freezing classifier is not a fixed object, but rather it has a set of parameters that can (must?) be tuned by the user to achieve desired performance. It is important to note that the freezing classifier performances reported in the paper should therefore be read with the understanding that these values are specific to the particular parameter configuration found (rather than reflecting performance a user could get out of the box.)

    This points to a central component of BehaviorDEPOT's design that makes its classifiers different from those produced by previously published behavior detection software such as JAABA or SimBA. So far as I can tell, BehaviorDEPOT includes no automated classifier fitting, instead relying on the users to come up with which features to use and which thresholds to assign to those features. Given that the classifier optimization module still requires manual annotations (to calculate classifier performance, Fig 7A), I'm unsure whether hand selection of features offers any kind of advantage over a standard supervised classifier training approach. That doesn't mean an advantage doesn't exist- maybe the hand-fit classifiers require less annotation data than a supervised classifier, or maybe humans are better at picking "appropriate" features based on their understanding of the behavior they want to study.

    There is something to be said for helping users hand-create behavior classifiers: it's easier to interpret the output of those classifiers, and they could prove easier to fine-tune to fix performance when given out-of-sample data. Still, I think it's a major shortcoming that BehaviorDEPOT only allows users to use up to two parameters to create behavior classifiers, and cannot create thresholds that depend on linear or nonlinear combinations of parameters (eg, Figure 6D indicates that the best classifier would take a weighted sum of head velocity and change in head angle.) Because of these limitations on classifier complexity, I worry that it will be difficult to use BehaviorDEPOT to detect many more complex behaviors.

    Finally, I have some concerns about how performance of classifiers is reported. For example, the authors describe "validation" set of videos used to assess freezing classifier performance, but they are very vague about the detector was trained in the first place, stating "we empirically determined that thresholding the velocity of a weighted average of 3-6 body parts ... and the angle of head movements produced the best-performing freezing classifier." What videos were used to come to this conclusion? It is imperative that when performance values are reported in the paper, they are calculated on a separate set of validation videos, ideally from different animals, that were *never referenced* while setting the parameters of the classifier. Otherwise, there is a substantial risk of overfitting, leading to overestimation of classifier performance. Similarly, Figure 7 shows the manual fitting of classifiers to rat and mouse data; the fitting process in 7A is shown to include updating parameters and recalculating performance iteratively. This approach is fine, however I want to confirm that the classifier performances in panels 7F-G were computed on videos not used during fitting.

    Overall, I like the user-friendly interface of this software, its interaction with experimental hardware, and its support for hand-crafted behavior classification. However, I feel that more work could be done to support incorporation of additional features and feature combinations as classifier input- it would be great if BehaviorDEPOT could at least partially automate the classifier fitting process, eg by automatically fitting thresholds to user-selected features, or by suggesting features that are most correlated with a user's provided annotations. Finally, the validation of classifier performance should be addressed.

  5. Reviewer #3 (Public Review):

    There is a need for standardized pipelines that allow for repeatable robust analysis of behavioral data, and this toolkit provides several helpful modules that researchers will find useful. There are, however, several weaknesses in the current presentation of this work.

    It is unclear what the major advance is that sets BehaviorDEPOT apart from other tools mentioned (ezTrack, JAABA, SimBA, MARS, DeepEthogram, etc). A comparison against other commonly used classifiers would speak to the motivation for BehaviorDEPOT - especially if this software is simpler to use and equally efficient at classification. While the idea might be that joint-level tracking should simplify the classification process, the number of markers used in some of the examples is limited to small regions on the body and might not justify using these markers as input data. The functionality of the tool seems to rely on a single type of input data (a small number of keypoints labeled using DeepLabCut) and throws away a large amount of information in the keypoint labeling step. If the main goal is to build a robust freezing detector then why not incorporate image data (particularly when the best set of key points does not include any limb markers)? Are the thresholds chosen for smoothing and convolution adjusted based on agreement to a user-defined behavior? Jitter is mentioned as a limiting factor in freezing classifier performance - does this affect human scoring as well? The use of a weighted average of body part velocities again throws away information - if one had a very high-quality video setup with more markers would optimal classification be done differently? What if the input instead consisted of 3D data, whether from multi-camera triangulation or other 3D pose estimation? Multi-animal data?

    It is unclear where the manual annotation of behavior is used in the tool as currently stands. Is the validation module used to simply say that the freezing detector is as good as a human annotator? One might expect that algorithms which use optic flow or pixel-based metrics might be superior to a human annotator, is it possible to benchmark against one of these? For behaviors other than freezing, a tool to compare human labels seems useful. The procedure described for converging on a behavioral definition is interesting and an example of this in a behavior other than freezing, especially where users may disagree, would be informative. It appears that manual annotation doesn't actually happen in the GUI and a user must create this themselves - this seems unnecessarily complicated.

    A major benefit of BehaviorDEPOT seems to be the ability to run experiments, but the ease of programming specific experiments is not readily apparent. The examples provided use different recording methods and networks for each experimental context as well as different presentations of data - it is not clear which analyses are done automatically in BehaviorDEPOT and which require customizing code or depend on the MiniCAM platform and hardware. For example - how does synchronization with neural or stimulus data occur? Overall it is difficult to judge how these examples would be implemented without some visual documentation.