Quantifying dynamic facial expressions under naturalistic conditions

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary

    This paper describes the development and validation of an automatic approach that leverages machine vision and learning techniques to quantify dynamic facial expressions of emotion. The potential clinical and translational significance of this automated approach is then examined in a "proof-of-concept" follow-on study, which leveraged video recordings of depressed individuals watching humorous and sad video clips.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Facial affect is expressed dynamically – a giggle, grimace, or an agitated frown. However, the characterisation of human affect has relied almost exclusively on static images. This approach cannot capture the nuances of human communication or support the naturalistic assessment of affective disorders. Using the latest in machine vision and systems modelling, we studied dynamic facial expressions of people viewing emotionally salient film clips. We found that the apparent complexity of dynamic facial expressions can be captured by a small number of simple spatiotemporal states – composites of distinct facial actions, each expressed with a unique spectral fingerprint. Sequential expression of these states is common across individuals viewing the same film stimuli but varies in those with the melancholic subtype of major depressive disorder. This approach provides a platform for translational research, capturing dynamic facial expressions under naturalistic conditions and enabling new quantitative tools for the study of affective disorders and related mental illnesses.

Article activity feed

  1. Evaluation Summary

    This paper describes the development and validation of an automatic approach that leverages machine vision and learning techniques to quantify dynamic facial expressions of emotion. The potential clinical and translational significance of this automated approach is then examined in a "proof-of-concept" follow-on study, which leveraged video recordings of depressed individuals watching humorous and sad video clips.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their name with the authors.)

  2. Reviewer #1 (Public Review):

    This manuscript aims to compare automatic measurements of the facial behavior of participants with and without melancholic depression while they watch humorous and sad video stimuli. A data-driven method for representing each participants' automatically measured facial behavior dynamics was developed and demonstrated on a publicly available dataset of 27 healthy participants' facial reactions to video stimuli (DISFA) and then applied to a private dataset of 38 healthy controls and 30 participants with melancholic depression watching different video stimuli. Parameters of the model were statistically compared and visualized between groups and machine learning models trained on these parameters were compared to machine learning models trained on baseline features.

    1. The introduction, in my opinion, overstates the accuracy/reliability with which facial expressions can be automatically recognized. It is important to consider (and mention) the difference between posed and spontaneous expressions and the challenge of domain transfer (Cohn et al., 2019).

    2. Given how much these methods rely on the AU estimates and how much of the interpretation is given in terms of AUs, providing direct validity evidence for these estimates is quite important. Please report the per-AU accuracy of OpenFace in DISFA (as compared to the human coding). Note explicitly that OpenFace was trained on DISFA, so this reported accuracy is likely an overestimate of how it would do on truly new data (e.g., your depression dataset). The fact that there is not validity evidence in the depression dataset itself should be listed as a limitation.

    3. The small sample size should be noted as a limitation. I know the difficulties of collecting this type of data firsthand, but it is an important limitation on generalizability nonetheless.

    4. Are 100 bootstrap resamples enough for stable uncertainty estimation? Please provide a rationale for the selection of this number.

    5. The decision to drop one of the two positive stimulus videos from the melancholia analysis needs justification. Given that the differences between groups appeared smaller in this video (at least what was shown in visualizations, dropping this video may make the difference between groups appear larger or more consistent than we have reason to believe it is given the entire data).

    6. For the SVM described on page 24, please clarify whether the observations were assigned to folds by cluster (i.e., participant) or whether observations of the same participant could appear in both the training and testing sets on any given iteration. (The former is more rigorous.) Please also clarify whether the folds were stratified by class (as this has implications for the interpretation of the accuracy metric).

    7. The performance of the competing SVM models should be statistically compared using a mixed effects model (e.g., Corani et al., 2017).

  3. Reviewer #2 (Public Review):

    The authors use machine learning to relate videos of facial expressions to a clinically-relevant features.

    This is a well-written, very clear paper that outlines a novel procedure to assess a set of features that is very easy and cheap to collect within a clinical context. The methods are relatively straightforward (which is a good thing), and from they are technically applied without flaws as far as I can tell.

    I would just wonder about the actual path to clinical translation, if that's the aim. So, how could this pipeline be actually applied in practice? Would a doctor be able to make an effective use of it? Is it intended as a first (cheap and automatised) step in a diagnostic procedure?