Social-affective features drive human representations of observed actions

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This study investigates and characterizes the representations of actions in naturalistic movie stimuli. The combination of the analytical techniques and stimulus domain make the paper likely to be of broad interest to scientists interested in action representation amidst complex sequences. This paper will potentially broaden our understanding of visual action representation and the extraction of such information in natural settings, but clarification of the analyses and aspects of the data are required to strengthen the claims.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Humans observe actions performed by others in many different visual and social settings. What features do we extract and attend when we view such complex scenes, and how are they processed in the brain? To answer these questions, we curated two large-scale sets of naturalistic videos of everyday actions and estimated their perceived similarity in two behavioral experiments. We normed and quantified a large range of visual, action-related and social-affective features across the stimulus sets. Using a cross-validated variance partitioning analysis, we found that social-affective features predicted similarity judgments better than, and independently of, visual and action features in both behavioral experiments. Next, we conducted an electroencephalography (EEG) experiment, which revealed a sustained correlation between neural responses to videos and their behavioral similarity. Visual, action, and social-affective features predicted neural patterns at early, intermediate and late stages respectively during this behaviorally relevant time window. Together, these findings show that social-affective features are important for perceiving naturalistic actions, and are extracted at the final stage of a temporal gradient in the brain.

Article activity feed

  1. Authors Response:

    Reviewer #2 (Public Review):

    The authors use representational similarity analysis on a combination of behavioral similarity ratings and EEG responses to investigate the representation of actions. They specifically explore the role of visual, action-related, and social-affective features in explaining the similarity ratings and brain responses. They find that social-affective features best explain the similarity ratings, and that visual, action-related, and social-affective features each explain some of the variance in the EEG responses in a temporal progression (from visual to action-related to social-affective).

    The stimulus set is nicely constructed, broadly sampled from a large set of naturalistic stimuli to minimize correlations between features of interest. I'd like to acknowledge and appreciate the work that went into this in particular.

    The analyses of the behavioral similarity judgments are well executed and interesting. The subject exclusion criteria and catch trials for online workers are smart choices, and the authors have tested a good range of models drawn from different categories. I find the case that the authors make for social features as determinants of behavioral similarity ratings to be compelling.

    I have a few questions and requests for additional detail about the EEG analyses. I appreciate that the authors have provided the code they used for all the analyses, and I'm sure that the answers to many if not all of my questions are there, but I don't have access to a Matlab license to run the code. Also, since the code requires familiarity with not just Matlab but with specific libraries to understand, I think that more description of the analysis in the paper would be appropriate.

    Some more detail is needed in the description of the multivariate classifier analysis. The authors write (line 597-599): "The two pseudotrials were used to train and test the classifier separately at each timepoint, and multivariate noise normalization was performed using the covariance matrix of the training data (Guggenmos et al., 2018). "

    I suspect I'm missing something here, because as written this sounds as if there was only one trial on which to train the classifier, which does not seem compatible with SVM classification. If only one trial was used to train the classifier, that sounds more like nearest-neighbor classification (or something else). Alternatively, if all different pseudo-trial averages - each incorporating a different subset of trials - were used for training, then that would seem to mean that some of the training pseudo-trials contained information from trials that were also averaged into the pseudo-trials used for testing. I don't know if this was done (probably not) but if it was it would constitute contamination of the test set. I think this part of the methods needs more detail so we can evaluate it. How many trials were used to train and to test for each iteration?

    Thank you for raising this issue; we agree that our Methods section was unclear on this point. We used split-half cross-validation. There was one pseudotrial for training per condition (which was obtained by averaging trials). There was no contamination between the training and test sets, because the data was first divided into separate training and test sets, and only afterwards averaged into pseudotrials for classification. This procedure was repeated 10 times with different data splits to obtain more reliable estimates of the classification performance. We rewrote the corresponding section to make this clearer:

    “Split-half cross-validation was used to classify each pair of videos in each participant’s data. To do this, the single-trial data was divided into two halves for training and testing, whilst ensuring that each condition was represented equally. To improve SNR, we combined multiple trials corresponding to the same video into pseudotrials via averaging. The creation of pseudotrials was performed separately within the training and test sets. As each video was shown 10 times, this resulted in a maximum of 5 trials being averaged to create a pseudotrial. Multivariate noise normalization was performed using the covariance matrix of the training data (Guggenmos et al., 2018). Classification between all pairs of videos was performed separately for each time-point. […] The entire procedure, from dataset splitting to classification, was repeated 10 times with different data splits.”

    We also performed the decoding procedure with a higher number of cross-validation folds and found very similar results.

    I think a bit more detail is also necessary to clarify the features used for the classification. My understanding is that each timepoint was classified as one action vs each other action on the basis of all the electrodes in the EEG for a given temporal window. Is this correct? (I'm guessing / inferring more than a little here.)

    This is correct, and we agree that further clarification was needed in text. We have added this:

    “Classification between all pairs of videos was performed separately for each time-point. Data were sampled at 500 Hz and so each time point corresponded to non-overlapping 2 ms of data. Voltage values from all EEG channels were entered as features to the classification model.

    The entire procedure, from dataset splitting to classification, was repeated 10 times with different data splits. The average decoding accuracies between all pairs of videos were then used to generate a neural RDM at each time point for each participant. To generate the RDM, the dissimilarity between each pair of videos was determined by their decoding accuracy (increased accuracy representing increased dissimilarity at that time point).”

    It would be useful to know how many features constituted each feature space. For example, was motion energy reduced to one summary feature (total optic flow for whole sequence?) For "pixel value", is that luminance? (I suspect so, since hue is quantified separately, but I don't think this was specified).

    For motion energy, we used the magnitude of the optic flow, and calculated Euclidean distances between the vectorized magnitude maps rather than reducing it to summary features. We have included the dimensionality of each feature in Supplementary File 1b and we now refer to it in text:

    “These features were vectorized prior to computing Euclidean distances between them (see Supplementary File 1b for the dimensionality of each feature).”

    Pixel value was indeed the luminance, and we have clarified this in text.

    More broadly, I would appreciate a bit more discussion of the role of time in these analyses. Each clip unfolds over half a second, so what should we make of the temporal progression of RDM correlations? Are the social and affective features correlated with later responses because they take more time to compute (neurally speaking), or because they depend on longer temporal integration of information? These two are not even exactly mutually exclusive, and I realize that it may be difficult to say with certainty based on this data, but I think some discussion of this issue would be appropriate.

    This is a great point, although it is difficult to speculate based on this data. One way to get at this would be to examine how much social-affective processing relies on previously extracted features. Future work could look at the causality between early and later-stage EEG features (unfortunately our post-hoc attempts to address this via Granger-causal analysis were unsuccessful, likely due to insufficient SNR with our specific experimental design). Alternatively, this could be investigated in a follow-up experiment that varies how social information unfolds over time (e.g., images vs. videos or varying video duration). We now discuss this possibility in the manuscript:

    “Given the short duration of our videos and the relatively long timescale of neural feature processing, it is possible that social-affective features are the result of ongoing processing relying on temporal integration of the previously extracted features. However, more research is needed to understand how these temporal dynamics change with continuous visual input (e.g. a natural movie), and whether social-affective features rely on previously extracted information.”

  2. Evaluation Summary:

    This study investigates and characterizes the representations of actions in naturalistic movie stimuli. The combination of the analytical techniques and stimulus domain make the paper likely to be of broad interest to scientists interested in action representation amidst complex sequences. This paper will potentially broaden our understanding of visual action representation and the extraction of such information in natural settings, but clarification of the analyses and aspects of the data are required to strengthen the claims.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)

  3. Reviewer #1 (Public Review):

    This study examined the features and the corresponding temporal dynamics underlying the ability to understand actions from naturalistic visual stimuli. To this aim, the authors applied a combination of a priori feature labels and behavioural and neural measures of representational spaces to a large-scale dataset. Features were selected on the basis of previous studies and were divided into visual, action-related and socio-affective features. The authors determined the features that best explained behavioural data (measured via a multi-arrangement task of the same actions) and the corresponding neural dynamics (measured with EEG). The authors observed that socio-affective features predicted behavioural data better than visual and action-related features, with a temporal progression from visual to action-related to socio-affective features.

    Strengths:

    The manuscript is well written and addresses an important gap in the literature, namely, the features that contribute to the understanding of visual actions and the underlying neural dynamics. In contrast to most previous studies that used a relatively small range of actions, the current study used a large-scale dataset of visual actions and used a range of a priory feature labels as well as behavioural multi-arrangement data and measures of the neural representational space of these actions using EEG. The authors should be commended for replicating their behavioural data in two separate data sets, and for trying to minimize the correlations between features.

    Weaknesses:

    The authors wished to distinguish between visual, action-related and socio-affective features, but the assignment of features to these different domains was not always straightforward (as an example, should the number of agents be considered a visual or a socio-affective feature?). Moreover, whereas the authors tried to minimize the correlations between features, some of the correlations were still significant, which may have biased the estimates of the beta weights, which in turn may have impacted the variance partitioning analysis.

  4. Reviewer #2 (Public Review):

    The authors use representational similarity analysis on a combination of behavioral similarity ratings and EEG responses to investigate the representation of actions. They specifically explore the role of visual, action-related, and social-affective features in explaining the similarity ratings and brain responses. They find that social-affective features best explain the similarity ratings, and that visual, action-related, and social-affective features each explain some of the variance in the EEG responses in a temporal progression (from visual to action-related to social-affective).

    The stimulus set is nicely constructed, broadly sampled from a large set of naturalistic stimuli to minimize correlations between features of interest. I'd like to acknowledge and appreciate the work that went into this in particular.

    The analyses of the behavioral similarity judgments are well executed and interesting. The subject exclusion criteria and catch trials for online workers are smart choices, and the authors have tested a good range of models drawn from different categories. I find the case that the authors make for social features as determinants of behavioral similarity ratings to be compelling.

    I have a few questions and requests for additional detail about the EEG analyses. I appreciate that the authors have provided the code they used for all the analyses, and I'm sure that the answers to many if not all of my questions are there, but I don't have access to a Matlab license to run the code. Also, since the code requires familiarity with not just Matlab but with specific libraries to understand, I think that more description of the analysis in the paper would be appropriate.

    Some more detail is needed in the description of the multivariate classifier analysis. The authors write (line 597-599): "The two pseudotrials were used to train and test the classifier separately at each timepoint, and multivariate noise normalization was performed using the covariance matrix of the training data (Guggenmos et al., 2018). "

    I suspect I'm missing something here, because as written this sounds as if there was only one trial on which to train the classifier, which does not seem compatible with SVM classification. If only one trial was used to train the classifier, that sounds more like nearest-neighbor classification (or something else). Alternatively, if all different pseudo-trial averages - each incorporating a different subset of trials - were used for training, then that would seem to mean that some of the training pseudo-trials contained information from trials that were also averaged into the pseudo-trials used for testing. I don't know if this was done (probably not) but if it was it would constitute contamination of the test set. I think this part of the methods needs more detail so we can evaluate it. How many trials were used to train and to test for each iteration?

    I think a bit more detail is also necessary to clarify the features used for the classification. My understanding is that each timepoint was classified as one action vs each other action on the basis of all the electrodes in the EEG for a given temporal window. Is this correct? (I'm guessing / inferring more than a little here.)

    It would be useful to know how many features constituted each feature space. For example, was motion energy reduced to one summary feature (total optic flow for whole sequence?) For "pixel value", is that luminance? (I suspect so, since hue is quantified separately, but I don't think this was specified).

    More broadly, I would appreciate a bit more discussion of the role of time in these analyses. Each clip unfolds over half a second, so what should we make of the temporal progression of RDM correlations? Are the social and affective features correlated with later responses because they take more time to compute (neurally speaking), or because they depend on longer temporal integration of information? These two are not even exactly mutually exclusive, and I realize that it may be difficult to say with certainty based on this data, but I think some discussion of this issue would be appropriate.