Selfee, self-supervised features extraction of animal behaviors

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    Jia et al. present an exciting machine learning framework named "Selfee" for unsupervised and objective analysis of animal behavior that should draw broad interest from researchers studying quantitative animal behavior. However, there are some unresolved issues for establishing credibility of the method that needs to be addressed.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Fast and accurately characterizing animal behaviors is crucial for neuroscience research. Deep learning models are efficiently used in laboratories for behavior analysis. However, it has not been achieved to use an end-to-end unsupervised neural network to extract comprehensive and discriminative features directly from social behavior video frames for annotation and analysis purposes. Here, we report a self-supervised feature extraction (Selfee) convolutional neural network with multiple downstream applications to process video frames of animal behavior in an end-to-end way. Visualization and classification of the extracted features (Meta-representations) validate that Selfee processes animal behaviors in a way similar to human perception. We demonstrate that Meta-representations can be efficiently used to detect anomalous behaviors that are indiscernible to human observation and hint in-depth analysis. Furthermore, time-series analyses of Meta-representations reveal the temporal dynamics of animal behaviors. In conclusion, we present a self-supervised learning approach to extract comprehensive and discriminative features directly from raw video recordings of animal behaviors and demonstrate its potential usage for various downstream applications.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    The authors sought to create a machine learning framework for analyzing video recordings of animal behavior, which is both efficient and runs in an unsupervised fashion. The authors construct Selfee from recent computational neural network codes. As the paper is methodsfocused, the key metrics for success would be (1) whether Selfee performs similarly or more accurately than existing methods, and more importantly (2) whether Selfee uncovers new behavioral features or dynamics otherwise missed by those existing methods.

    Weaknesses:

    Although the basic schematics of Selfee are laid out, and the code itself is available, I feel that material in between these two levels of description is somewhat lacking. Details of what other previously published machine learning code makes up Selfee, and how those parts work would be helpful. Some of this is in the methods section, but an expanded version aimed at a more general readership would be helpful.

    Thanks for the suggestions. We expanded the paragraphs describing training objectives and AR-HMM analysis. We also revised Figure 2C for clarity, and we have added a new figure, Figure 6, to describe how our pipeline works in detail. We also added a detailed instructions for Selfee usage on our GitHub page.

    *The paper highlights efficiency as an important aspect of machine learning analysis techniques in the introduction, but there is little follow up with this aspect.

    Our model only had a more efficient training process compared with other self-supervised learning methods. We also found our model could perform zero-shot domain transfer, so training may not even be necessary. However, we did not mean that our model was superior in terms of data efficiency or inference speed. We have revised some of the claims in the Discussion.

    *In comparing Selfee to other approaches, the paper uses DeepLabCut, but perhaps running other recent methods for more comprehensive comparison would be helpful as well.

    We compare Selfee feature extraction with features from FlyTracker or JAABA, two widely used software. We also visualized the tracking results of SLEAP and FlyTracker in complement to the DeepLabCut experiment.

    *Using Selfee to investigate courtship behavior and other interactions was nicely demonstrated. Running it on simpler data (say, videos of individual animals walking around or exploring a confined space) might more broadly establish the method's usefulness.

    We used Selfee with open field test (OFT) of mice after chronic immobilization stress (CIS) treatment. We demonstrated that our pipeline from data preprocessing to all the data mining algorisms with this experiment, and the results were added to the last section of Results.

    Reviewer #2 (Public Review):

    Jia et al. present a CNN based tool named "Selfee" for unsupervised quantification of animal behavior that could be used for objectively analyzing animal behavior recorded in relatively simple setups commonly used by various neurobiology/ethology laboratories. This work is very relevant but has some serious unresolved issues for establishing credibility of the method.

    Overall Strengths: Jia et al have leveraged a recent development "Simple Siamese CNNs" to work for behavioral segmentation. This is a terrific effort and theoretically very attractive.

    Overall Weakness: Unfortunately, the data supporting the method is not as promising. It is also riddled with incomplete information and lack of rationale behind the experiments.

    Specific points of concern:

    1. No formal comparison with pre-existing methods like JAABA which would work on similar videos as Selfee.

    We added some comparisons with JAABA and FlyTracker extracted features, and also visualized FlyTracker and SLEAP tracking results aside from DeepLabCut. This result is now in the new Table 1. To avoid tracking inaccuracy during intensive interactions and potential inappropriately tuned parameters, we used a peer-reviewed dataset focused on wing extension behavior only. Our results showed a competitive performance of Selfee as other methods.

    1. For all Drosophila behavior experiments, I'm concerned about the control and test genetic background. Several studies have reported that social behaviors like courtship and aggression are highly visual and sensitive to genetic background and presence of "white" gene. The authors use Canton S (CS) flies as control data. Whereas it is unclear if any or all of the test genotypes have been crossed into this background. It would be helpful if authors provide genotype information for test flies.

    We have added a detailed sheet about their genotype in this version. The genetic information of all animals can also be found on the Bloomington fly center by the IDs provided. In brief, five fly lines used in this work are in the CS background: CCHa2-R-RAGal4, CCHa2-R-RBGal4, Dop2RKO, DopEcRGal4 and Tdc2RO54. We did not back cross other flies into the CS background for three reasons. First, most mutant lines are compared with their appropriate control lines. For example, in the original Figure 3B (the new Figure 4B), for CCHa2-R-RBGal4 > Kir2.1 flies contained wildtype white gene, so the comparison with CS flies would not cause any problem. For TrhGal4 flies, they were in white background, and so were other lines that had no phenotype. At the same time, in the original Figure 3G to J (the new Figure 4G to J), we used w1118 as controls for TrhGal4 flies, which were all in mutated white background. Second, in the original Figure 4F and G (the new Figure 5F and G), we admitted that the comparison between NorpA36, in mutated white background, and CS flies was not very convincing. Nevertheless, the delayed dynamic of NorpA mutants was reported before, and our experiment was just a demonstration of the DTW algorithm. Lastly, our method focused on the methodology of animal behavior analysis, and original videos were provided for research replications. Therefore, even if the behavioral difference was due to genetic backgrounds, it would not affect the conclusion that our method could detect the difference

    1. Utility of "anomaly score" rests on Fig 3 data. Authors write they screened "neurotransmitter-related mutants or neuron silenced lines" (lines 251-252). Yet Figure 3B lacks some of the most commonly occurring neurotransmitter mutants/neuron labeling lines (e.g. Acetelcholine, GABA, Dopamaine, instead there are some neurotransmitter receptor lines, but then again prominent ones are missing). This reduces the credibility of this data.

    First of all, this paper did not intend to conduct new screening assays, rather we used pre-existed data in the lab to demonstrate the application of Selfee. Previous work in our lab focused on the homeostatic control of fly behaviors, so most listed lines used here were originally used to test the roles of neuropeptides or neurons nutrient and metabolism regulation, such as CCHarelated lines, a CNMa mutant, and Taotie neuron silenced flies. There were some other important genes that were not involved in this dataset. Some most common transmitters are not included for two reasons. First, common neurotransmitters usually have a very global and broad effect on animal behaviors, and even if there is any new discovery, it could be difficult to interpret the phenomenon due to a large number of disturbed neurons. Second, most mutants of those common neurotransmitters are not viable, for example, paleGal4 as a mutant for dopamine; Gad1A30 for GABA, and ChATl3 for acetylcholine. However, we did perform experiments on serotonin-related genes (SerT and Trh), octopamine-related genes (Tdc and Oamb), and some other viable dopamine receptor mutants.

    1. The utility of AR-HMM following "Selfee" analysis rests on the IR76b mutant experiment (Fig4). This is the most perplexing experiment! There are so many receptors implicated in courtship and IR76b is definitely not among the most well-known. None of the citations for IR76b in this manuscript have anything to do with detection of female pheromones. IR76b is implicated in salt and amino acid sensation. The authors still call this "an extensively studies (co)receptor that is known to detect female pheromones" (lines310-311). Unsurprisingly the AR-HMM analysis doesn't find any difference in modules related to courtship. Unless I'm mistaken the premise for this experiment is wrong and hence not much weight should be given to its results.

    We have removed the Ir76b results from the Results. The demonstration of AR-HMM was now done with a mouse open field assay.

    Reviewer #3 (Public Review):

    This paper is describing a machine learning method applied to videos of animals. The method requires very little pre-processing (end-to-end) such as image segmentation or background subtraction. The input images have three channels, mapping temporal information (liveframes). The architecture is based on tween deep neural networks (Siamese network) and does not require human annotated labels (unsupervised learning). However, labels can still be used if they are produced, as in this case, by the algorithm itself - self-supervised learning. This flavor of machine learning is reflected in the name of the method: "Selfee." The authors are convincingly applying the Selfee to several challenging animal behavior tasks which results in biologically relevant discoveries.

    A significant advantage of unsupervised and self-supervised learning is twofold: 1) it allows for discovering new behaviors, and 2) it doesn't require human-produced labels.

    In this case of self-supervised learning the features (meta-representations) are learned from two views of the same original image (live-frame), where one of the views is augmented in several different ways, with a hope to let the deep neural network (ResNet-50 architecture in this case) learn to ignore such augmentations, i.e. learn the meta-representations invariant to natural changes in the data similar to the augmentations. This is accomplished by utilizing a Siamese Convolutional Neural Network (CNN) with the ResNet-50 version as a backbone. Siamese networks are composed of tween deep nets, where each member of the pair is trying to predict the output of another. In applications such as face recognition they normally work in the supervised learning setting, by utilizing "triplets" containing "negative samples." These are the labels.

    However, in the self-supervised setting, which "Selfee" is implementing, the negative samples are not required. Instead the same image (a positive sample) is viewed twice, as described above. Here the authors use the SimSiam core architecture described by Chen, X. & He, K (reference 29 in the paper). They add Cross-Level Discrimination (CLD) to the SimSiam core. Together these two components provide two Loss functions (Loss 1 and Loss 2). Both are critical for the extraction of useful features. In fact, removing the CLD causes major deterioration of the classification performance (Figure 2-figure supplement 5).

    The authors demonstrate the utility of the Selfee by using the learned features (metarepresentations) for classification (supervised learning; with human annotation), discovering short-lasting new behaviors in flies by anomaly detection, long time-scale dynamics by ARHMM, and Dynamic Time Warping (DTW).

    For the classification the authors use k-NN (flies) and LightGBM (mice) classifiers and they infer the labels from the Selfee embedding (for each frame), and the temporal context, using the time-windows of 21 frames and 81 frames, for k-NN classification and LightGBM classification, respectively. Accounting for the temporal context is especially important in mice (LightGBM classification) so the authors add additional windowed features, including frequency information. This is a neat approach. They quantify the classification performance by confusion matrices and compute the F1 for each.

    Overall, I find these classification results compelling, but one general concern is the criticality of the CLD component for achieving any meaningful classification. I would suggest that the authors discuss in more depth why this component is so critical for the extraction of features (used in supervised classification) and compare their SimSiam architecture to other methods where the CLD component is implemented. In other words, to what degree is the SimSiam implementation an overkill? Could a simpler (and thus faster) method be used - with the CLD component - instead to achieve similar end-to-end classification? The answer would help illuminate the importance of the SimSiam architecture in Selfee.

    We added more about the contribution of the CLD loss in the last paragraph of Siamese convolutional neural networks capture discriminative representations of animal posture, the second section of Results. Further optimization of neural network architectures was discussed in the Discussion section. As for why CLD is that important, there are two main reasons. First of all, all behavior photos are so similar that it is not very easy to distinguish them from each other. In the field of so-called self-supervised learning without negative samples, researchers use either batch normalization or similar operations to implicitly utilize negative samples within a minibatch. However, when all samples are quite similar, it might not be enough. CLD uses explicit clusters to utilize negative samples within a minibatch, in the word of the authors “Our key insight is that grouping could result from not just attraction, but also common repulsion”, so that provides more powerful discrimination. The second reason is what the author argued in the CLD paper, CLD is very powerful in processing long-tailed datasets. As shown in the original Figure 2—figure supplement 5 (the new Figure 3—figure supplement 5), behavior data are highly unbalanced. As explained in the CLD paper. CLD fights against long-tailed distribution from two aspects. One is that it scales up the importance of negative samples within a mini-batch from 1/B to 1/K by k-means; another is that cluster operation could relieve the imbalance between the tail and head classes within a mini-batch. Here I quote: “While the distribution of instances in a random mini-batch is long-tailed, it would be more flattened across classes after clustering.” It was also visualized in Fig5 of the CLD paper.

    To the best of our knowledge, SimSiam is the simplest method that would work with CLD. In the original CLD paper, they combined CLD method with other popular frameworks including BYOL and Mocov2. However, those popular frameworks are more complicated than SimSiam networks. We have attempted to combine CLD with BarlowTwins but failed. As the author of CLD suggested on Github: “Hi, good to know that you are trying to combine CLD with BarLowTwins! My concern is also on the high feature dimension, which may cause the low clustering quality. Maybe it is necessary to have a projection layer to project the highdimensional feature space to a low-dimensional one.” In terms of speed, there are two major parts. For inference, only one branch is used, so the major contribution of efficiency comes from CNN backbone. In theory, light backbones like MobileNet would work, but ResNet50 is already fast enough on a model GPU. As for training, the major computational cost aside from the CNN backbone is from Siamese branches. Two branches, two times of computation. Nevertheless, CLD relied on this kind of structure, so even if the learning framework is simpler than Simsiam, it is not likely to achieve a faster training speed. As for other structures, I think this new instance learning framework (https://arxiv.org/abs/2201.10728) is possible to achieve a similar result with fewer data and in a shorter time. However, this powerful method could be used with CLD. We might try it in the future.

    One potential issue with unsupervised/self-supervised learning is that it "discovers" new classes based, not on behavioral features but rather on some other, irrelevant, properties of the video, e.g. proximity to the edges, a particular camera angle, or a distortion. In supervised learning the algorithm learns the features that are invariant to such properties, because humanmade labels are used and humans are great at finding these invariant features. The authors do mention a potential limitation, related to this issue, in the Discussion ("mode splitting"). One way of getting around this issue, other than providing negative samples, is to use a very homogeneous environment (so that only invariance to orientation, translation, etc, needs to be accomplished). This has worked nicely, for example, with posture embedding (Berman, G. J., et al; reference 19 in the manuscript). Looking at the t-SNE plots in Figure 2 one must wonder how many of the "clusters" present there are the result of such learning of irrelevant (for behavior) features, i.e. how good is the generalization of the meta-representations. The authors should explore the behaviors found in different parts of the t-SNE maps and evaluate the effect of the irrelevant features on their distributions. For example, they may ask: to what extent does the distance of an animal from the nearest wall affect the position in the t-SNE map? It would be nice to see how various simple pre-processing steps might affect the t-SNE maps, as well as the classification performance. Some form of segmentation, even very crude, or simply background subtraction, could go a very long way towards improving the features learned by Selfee.

    In the new Figure 3—figure supplement 1, the visualization demonstrates that our features contained a lot of physical information, including wing angles, animal distance and positions in the chamber. “Mode-split” can be partially explained by those features. We actually performed background subtraction and image crop for mice behaviors, where we found them useful.

    The anomaly detection is used to find unusual short-lasting events during male-male interaction behavior (Figure 3). The method is explained clearly. The results show how Selfee discovered a mutant line with a particularly high anomaly score. The authors managed to identify this behavior as "brief tussle behavior mixed with copulation attempts." The anomaly detection analyses were also applied to discover another unusual phenotype (close body contact) in another mutant line. Both results are significant when compared to the control groups.

    The authors then apply AR-HMM and DTW to study the time dynamics of courtship behavior. Here too, they discover two phenotypes with unusual courtship dynamics, one in an olfactory mutant, and another in flies where the mutation affects visual transduction. Both results are compelling.

    The authors explain their usage of DTW clearly, but they should expand the description of the AR-HMM so that the reader doesn't have to study the original sources.

    We expanded the section that talks about AR-HMM mechanisms.

  2. Evaluation Summary:

    Jia et al. present an exciting machine learning framework named "Selfee" for unsupervised and objective analysis of animal behavior that should draw broad interest from researchers studying quantitative animal behavior. However, there are some unresolved issues for establishing credibility of the method that needs to be addressed.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    The authors sought to create a machine learning framework for analyzing video recordings of animal behavior, which is both efficient and runs in an unsupervised fashion. The authors construct Selfee from recent computational neural network codes. As the paper is methods-focused, the key metrics for success would be (1) whether Selfee performs similarly or more accurately than existing methods, and more importantly (2) whether Selfee uncovers new behavioral features or dynamics otherwise missed by those existing methods.

    Strengths:
    * The authors put their work in context very well, discussing machine learning approaches to behavior extraction generally, and clearly stating the unique aspects of their own approach. The schematic framework of Selfee is nicely described.
    * The authors use their new methods on existing data sets, mostly in adult Drosophila but also in rodents, with the resulting outputs confirming and accurately classifying known behaviors, in agreement with manual annotation.
    * The analysis focuses on behavior video that depicts interactions between animals, typically more difficult than either individual animal video or video with noninteracting animals. This adds to the strength of the method.
    * Experiments with mutants and Kir-silenced lines were nicely designed, and highlighted Selfee's anomaly detection methods by finding a short-time-scale behavior unlikely to be noticed by manual human observation.
    * Similarly, experiments investigating Trh in flies were very thorough and detailed, and illustrate the effectiveness of the machine learning analysis when combined with follow up experiments to investigate Selfee's initial findings.

    Weaknesses:
    * Although the basic schematics of Selfee are laid out, and the code itself is available, I feel that material in between these two levels of description is somewhat lacking. Details of what other previously published machine learning code makes up Selfee, and how those parts work would be helpful. Some of this is in the methods section, but an expanded version aimed at a more general readership would be helpful.
    * The paper highlights efficiency as an important aspect of machine learning analysis techniques in the introduction, but there is little follow up with this aspect.
    * In comparing Selfee to other approaches, the paper uses DeepLabCut, but perhaps running other recent methods for more comprehensive comparison would be helpful as well.
    * Using Selfee to investigate courtship behavior and other interactions was nicely demonstrated. Running it on simpler data (say, videos of individual animals walking around or exploring a confined space) might more broadly establish the method's usefulness.

    Overall, the results of the paper seem to clearly achieve what was set out in the introduction, which was to use an unsupervised machine learning video analysis method to uncover new features of behavior. The experiments establishing the effectiveness seem very sound and reasonable.

    For a reader who does not work directly with implementing machine learning, the paper is highly readable and interesting and should generate interest to a wider audience of researchers who would wish to try Selfee on their own data. The paper could have made it a bit more clear how an inexperienced user might deploy the Selfee software, whether with one of the model systems used here or a different one. But there is certainly something very appealing about an unsupervised method, which has the potential to be more accessible to a wider audience of researchers, allowing more people to take advantage of sophisticated behavior analysis.

  4. Reviewer #2 (Public Review):

    Jia et al. present a CNN based tool named "Selfee" for unsupervised quantification of animal behavior that could be used for objectively analyzing animal behavior recorded in relatively simple setups commonly used by various neurobiology/ethology laboratories. This work is very relevant but has some serious unresolved issues for establishing credibility of the method.

    Overall Strengths: Jia et al have leveraged a recent development "Simple Siamese CNNs" to work for behavioral segmentation. This is a terrific effort and theoretically very attractive.

    Overall Weakness: Unfortunately, the data supporting the method is not as promising. It is also riddled with incomplete information and lack of rationale behind the experiments.

    Specific points of concern:

    1. No formal comparison with pre-existing methods like JAABA which would work on similar videos as Selfee.
    2. For all Drosophila behavior experiments, I'm concerned about the control and test genetic background. Several studies have reported that social behaviors like courtship and aggression are highly visual and sensitive to genetic background and presence of "white" gene. The authors use Canton S (CS) flies as control data. Whereas it is unclear if any or all of the test genotypes have been crossed into this background. It would be helpful if authors provide genotype information for test flies.
    3. Utility of "anomaly score" rests on Fig 3 data. Authors write they screened "neurotransmitter-related mutants or neuron silenced lines" (lines 251-252). Yet Figure 3B lacks some of the most commonly occurring neurotransmitter mutants/neuron labeling lines (e.g. Acetelcholine, GABA, Dopamaine, instead there are some neurotransmitter receptor lines, but then again prominent ones are missing). This reduces the credibility of this data.
    4. The utility of AR-HMM following "Selfee" analysis rests on the IR76b mutant experiment (Fig4). This is the most perplexing experiment! There are so many receptors implicated in courtship and IR76b is definitely not among the most well-known. None of the citations for IR76b in this manuscript have anything to do with detection of female pheromones. IR76b is implicated in salt and amino acid sensation. The authors still call this "an extensively studies (co)receptor that is known to detect female pheromones" (lines310-311). Unsurprisingly the AR-HMM analysis doesn't find any difference in modules related to courtship. Unless I'm mistaken the premise for this experiment is wrong and hence not much weight should be given to its results.

    Concluding remarks: The method has some promise but the authors have not presented a proper rationale for the parameters they have chosen and the experiments they performed for testing this tool. Furthermore, not all information is easily accessible (e.g. lack of genotype info) and hence there is little reason why a new user would turn to this method over existing alternatives.

  5. Reviewer #3 (Public Review):

    This paper is describing a machine learning method applied to videos of animals. The method requires very little pre-processing (end-to-end) such as image segmentation or background subtraction. The input images have three channels, mapping temporal information (live-frames). The architecture is based on tween deep neural networks (Siamese network) and does not require human annotated labels (unsupervised learning). However, labels can still be used if they are produced, as in this case, by the algorithm itself - self-supervised learning. This flavor of machine learning is reflected in the name of the method: "Selfee." The authors are convincingly applying the Selfee to several challenging animal behavior tasks which results in biologically relevant discoveries.

    A significant advantage of unsupervised and self-supervised learning is twofold: 1) it allows for discovering new behaviors, and 2) it doesn't require human-produced labels.

    In this case of self-supervised learning the features (meta-representations) are learned from two views of the same original image (live-frame), where one of the views is augmented in several different ways, with a hope to let the deep neural network (ResNet-50 architecture in this case) learn to ignore such augmentations, i.e. learn the meta-representations invariant to natural changes in the data similar to the augmentations. This is accomplished by utilizing a Siamese Convolutional Neural Network (CNN) with the ResNet-50 version as a backbone. Siamese networks are composed of tween deep nets, where each member of the pair is trying to predict the output of another. In applications such as face recognition they normally work in the supervised learning setting, by utilizing "triplets" containing "negative samples." These are the labels.

    However, in the self-supervised setting, which "Selfee" is implementing, the negative samples are not required. Instead the same image (a positive sample) is viewed twice, as described above. Here the authors use the SimSiam core architecture described by Chen, X. & He, K (reference 29 in the paper). They add Cross-Level Discrimination (CLD) to the SimSiam core. Together these two components provide two Loss functions (Loss 1 and Loss 2). Both are critical for the extraction of useful features. In fact, removing the CLD causes major deterioration of the classification performance (Figure 2-figure supplement 5).

    The authors demonstrate the utility of the Selfee by using the learned features (meta-representations) for classification (supervised learning; with human annotation), discovering short-lasting new behaviors in flies by anomaly detection, long time-scale dynamics by AR-HMM, and Dynamic Time Warping (DTW).

    For the classification the authors use k-NN (flies) and LightGBM (mice) classifiers and they infer the labels from the Selfee embedding (for each frame), and the temporal context, using the time-windows of 21 frames and 81 frames, for k-NN classification and LightGBM classification, respectively. Accounting for the temporal context is especially important in mice (LightGBM classification) so the authors add additional windowed features, including frequency information. This is a neat approach. They quantify the classification performance by confusion matrices and compute the F1 for each.

    Overall, I find these classification results compelling, but one general concern is the criticality of the CLD component for achieving any meaningful classification. I would suggest that the authors discuss in more depth why this component is so critical for the extraction of features (used in supervised classification) and compare their SimSiam architecture to other methods where the CLD component is implemented. In other words, to what degree is the SimSiam implementation an overkill? Could a simpler (and thus faster) method be used - with the CLD component - instead to achieve similar end-to-end classification? The answer would help illuminate the importance of the SimSiam architecture in Selfee.

    One potential issue with unsupervised/self-supervised learning is that it "discovers" new classes based, not on behavioral features but rather on some other, irrelevant, properties of the video, e.g. proximity to the edges, a particular camera angle, or a distortion. In supervised learning the algorithm learns the features that are invariant to such properties, because human-made labels are used and humans are great at finding these invariant features. The authors do mention a potential limitation, related to this issue, in the Discussion ("mode splitting"). One way of getting around this issue, other than providing negative samples, is to use a very homogeneous environment (so that only invariance to orientation, translation, etc, needs to be accomplished). This has worked nicely, for example, with posture embedding (Berman, G. J., et al; reference 19 in the manuscript). Looking at the t-SNE plots in Figure 2 one must wonder how many of the "clusters" present there are the result of such learning of irrelevant (for behavior) features, i.e. how good is the generalization of the meta-representations. The authors should explore the behaviors found in different parts of the t-SNE maps and evaluate the effect of the irrelevant features on their distributions. For example, they may ask: to what extent does the distance of an animal from the nearest wall affect the position in the t-SNE map? It would be nice to see how various simple pre-processing steps might affect the t-SNE maps, as well as the classification performance. Some form of segmentation, even very crude, or simply background subtraction, could go a very long way towards improving the features learned by Selfee.

    The anomaly detection is used to find unusual short-lasting events during male-male interaction behavior (Figure 3). The method is explained clearly. The results show how Selfee discovered a mutant line with a particularly high anomaly score. The authors managed to identify this behavior as "brief tussle behavior mixed with copulation attempts." The anomaly detection analyses were also applied to discover another unusual phenotype (close body contact) in another mutant line. Both results are significant when compared to the control groups.

    The authors then apply AR-HMM and DTW to study the time dynamics of courtship behavior. Here too, they discover two phenotypes with unusual courtship dynamics, one in an olfactory mutant, and another in flies where the mutation affects visual transduction. Both results are compelling.

    The authors explain their usage of DTW clearly, but they should expand the description of the AR-HMM so that the reader doesn't have to study the original sources.

    Overall this paper introduces a potentially useful tool as well as several interesting biological results obtained by applying it to videos with very little pre-processing. Both, the method and the results are convincing.