The Face Module Emerged in a Deep Convolutional Neural Network Selectively Deprived of Face Experience

Shan Xu
Yiyuan Zhang
Zonglei Zhen
Jia Liu

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (eLife)

Abstract

Can we recognize faces with zero experience on faces? This question is critical because it examines the role of experiences in the formation of domain-specific modules in the brain. Investigation with humans and non-human animals on this issue cannot easily dissociate the effect of the visual experience from that of the hardwired domain-specificity. Therefore, the present study built a model of selective deprivation of the experience on faces with a representative deep convolutional neural network, AlexNet, by removing all images containing faces from its training stimuli. This model did not show significant deficits in face categorization and discrimination, and face-selective modules automatically emerged. However, the deprivation reduced the domain-specificity of the face module. In sum, our study provides empirical evidence on the role of nature vs. nurture in developing the domain-specific modules that domain-specificity may evolve from non-specific experience without genetic predisposition, and is further fine-tuned by domain-specific experience.

Version published to 10.3389/fncom.2021.626259
May 20, 2021
eLife
Sep 29, 2020
###Reviewer #2:

General Assessment:

The role of visual experience with faces in the formation of face-specific neural "modules" is tested in a deep convolutional neural network model of object recognition, AlexNet. A modified version of the ILSVRC-2012 training dataset was constructed by removing all images with primate faces, removing remaining categories with fewer than 640 images, and re-training the deprived network: d-Alexnet. d-Alexnet was compared to pre-trained Alexnet on classification performance, quality of fit to fMRI data, strength of face-selectivity, representational similarity, and learned receptive field properties. The authors argue that face-selectivity is significantly reduced, but not eliminated, with the deprivation, and that this reduction is consistent with an interpretation that d-Alexnet represents faces more …
###Reviewer #2:

General Assessment:

The role of visual experience with faces in the formation of face-specific neural "modules" is tested in a deep convolutional neural network model of object recognition, AlexNet. A modified version of the ILSVRC-2012 training dataset was constructed by removing all images with primate faces, removing remaining categories with fewer than 640 images, and re-training the deprived network: d-Alexnet. d-Alexnet was compared to pre-trained Alexnet on classification performance, quality of fit to fMRI data, strength of face-selectivity, representational similarity, and learned receptive field properties. The authors argue that face-selectivity is significantly reduced, but not eliminated, with the deprivation, and that this reduction is consistent with an interpretation that d-Alexnet represents faces more similarly to objects than Alexnet. While this work is well-motivated and timely, there are substantial issues in the conceptual approach, the methods used, clarity of the results, and most importantly, the strength of the conclusions.

Major Concerns:

The validity of these results is uncertain due to a) insufficient reproducibility within this work and b) fragile definitions of face-selectivity.

a) Given that small changes in weight initialization or training procedure can have a large effect on learned representations (see Mehrer et al. 2020, https://www.biorxiv.org/content/10.1101/2020.01.08.898288v1.abstract ), the authors must demonstrate that their results hold across multiple initializations of each network type. Several key results hinge on the number and identity of "face-selective" channels (Figure 2, 3c-e) and only a single instance of each model type is used. In particular, the result that 2/256 channels are "selective" in d-Alexnet compared to 4/256 in Alexnet is likely sensitive to small variations in the methods, including the choice of evaluation stimuli and the initialization of the weights. If the models were re-trained, could the ratio be 4 channels to 4 channels, 0 channels to 2 channels, or some other result? With only a single instance of each model and such a small (and potentially unstable) number of face-selective channels in each model, I am not convinced that these results support the claims made.

SUGGESTION: Report results averaged across multiple initializations of each model to demonstrate robustness. Statistical tests should be conducted across models (as if they were individual subjects) to demonstrate the significance of any effects found.

b) The definition of "selectivity" is potentially fragile and may not hold when tested with more standard evaluation sets. In the primate face-selectivity literature, functional localizers are used to compare face responses to non-face responses. These localizers have much stronger controls over low-level features than the stimuli used to evaluate selectivity in this work. I am especially concerned that the faces (from FITW) differ from non-face objects (from Caltech-256) in low-level properties such as image resolution, pose, background, contrast, luminance, and more. Furthermore, selectivity is typically defined in the field as a continuous quantity (e.g., t-contrast, d-prime, face-selectivity-index) and is not often assessed in a binary fashion by the number of units significantly more responsive to faces than the second-best category. Many of these continuous metrics also incorporate variance in responses as well as the mean of responses. Thus, the designation of channels as "selective" or "not-selective" in this work based on mean responses to only 2 of the 205 categories (L101) prevents the reader from understanding how the distribution of face-selectivity shifted under the deprivation, which is one of the primary claims. Instead, we only see the number of selective channels after a binary cutoff, which may be sensitive to initialize and the stimulus set used to evaluate selectivity.

SUGGESTION: Compute selectivity using evaluation sets in which faces are better matched to non-face objects. Report the distribution of selectivity for each channel before and after deprivation.

Because one model in the comparison is pre-trained and the other is trained from scratch, there is the possibility that all of the differences between the models are due to differences in the training that are independent from the content of the training images.

a) In the regression analysis, is it the case that non-selective channels also show differences in R2? For example, if the d-Alexnet is worse on the training task (d-ImageNet) than Alexnet, we expect a general reduction in its ability to explain neural responses (see e.g. Yamins et al., 2014). The claims that face-selectivity is specifically impaired in d-Alexnet need to be supported by demonstration that non-selective channels are equally good (or poor) fits to vertices in face-selective regions. Furthermore, the authors do not demonstrate that face-selective channels are better than non-selective channels in either model type, which is useful context for understanding whether the correspondence between face-selective channels and face-selective brain regions is meaningful.

SUGGESTION: report non-selective channel fits to the same vertices for each model type and compare to face-selective channel fits.

b) L366: the authors write that "the d-Alexnet was initialized with values drawn from a uniform distribution". This is not standard practice; in fact, the kernel weights in the original AlexNet model were initialized from a Gaussian distribution. To make comparisons to the non-deprived model, the authors need to also retrain the non-deprived model to account for the potential confounds between their training/initialization procedure and that used in the pre-training.

SUGGESTION: re-train the non-deprived AlexNet in-house, then compare that model to d-AlexNet.

A major conceptual issue is in the definition of a "face module". Despite "face module" in the title, a working definition of "face module" is not clearly provided in the manuscript. Context clues suggest that the authors may consider any face-specific process evidence of a "face module", but the experiments performed indicate that a specific set of criteria were explored: selectivity for faces, different representations for faces and non-face objects, holistic processing, etc. Especially given that the results of this work indicate some residual face-selectivity, a clear definition of "face module" - grounded in the existing literature - is needed to evaluate the claims provided.

SUGGESTION: clearly define what the "face module" is in the brain, then explain what the corresponding evidence for a "face module" would be in the DCNN.

A number of analyses are not well-motivated or are lacking in detail

a) The analysis of the "empirical receptive field" is lacking in detail and motivation, and the color-scale is both nonlinear and missing a label. Specific questions:

i) How should this result be compared to data in primate face-selective regions?

ii) Is this result a trivial consequence of the difference in number of activated units (panel D)?

iii) What are the units of the colormap?

iv) Why are only two channels shown for AlexNet if 4 channels are face-selective?

v) Is the extent of the empirical receptive field quantified?

vi) How should the reader think about empirical receptive fields in a weight-shared convolutional architecture?

b) The evaluation of the face-inversion test is poorly motivated. The face-inversion effect indicates that human subjects are better at remembering upright faces than inverted faces. However, the analysis performed here evaluates the magnitude of the response of face-selective channels. If anything, a classification task is needed to compare to the human task, because the "face inversion effect" cited is not simply that face-selective units respond more strongly to upright than inverted faces, but that the activation of the units supports differences in classification between upright and inverted faces.

SUGGESTION: At minimum, justify 1) why the magnitude of channel response is a good measure of the face inversion effect or 2) remove the claim that the models do/don't exhibit the behavioral effect.
Read the original source
eLife
Sep 29, 2020
###Nancy Kanwisher (Reviewer #1):

Xu et al use deep nets to ask whether face selectivity, and face discrimination performance, can arise in a network that has never seen faces. By painstakingly removing all faces from the training set, and comparing Alexnet trained with and without faces, they claim to find, first, that the face-deprived network does not have deficits in face categorization or discrimination (relative to the same network trained with faces), second that the face-deprived network showed some face-selectivity, and third that face deprivation reduced face selectivity. They conclude that "domain-specificity may evolve from non-specific experience without genetic predisposition, and is further fine-tuned by domain-specific experience."

I love the question and the general strategy behind this study, and indeed we have long …
###Nancy Kanwisher (Reviewer #1):

Xu et al use deep nets to ask whether face selectivity, and face discrimination performance, can arise in a network that has never seen faces. By painstakingly removing all faces from the training set, and comparing Alexnet trained with and without faces, they claim to find, first, that the face-deprived network does not have deficits in face categorization or discrimination (relative to the same network trained with faces), second that the face-deprived network showed some face-selectivity, and third that face deprivation reduced face selectivity. They conclude that "domain-specificity may evolve from non-specific experience without genetic predisposition, and is further fine-tuned by domain-specific experience."

I love the question and the general strategy behind this study, and indeed we have long discussed doing something much like this in my lab, and we presented a preliminary result of this kind at VSS years ago (https://jov.arvojournals.org/article.aspx?articleid=2433862 ). It is a great use of deep nets to ask what kinds of structures can in principle arise with different kinds of training diets. Xu et al are also to be congratulated for the huge effort they went to in curating a data set of stimuli with no faces, for which they are correct no current algorithm is adequate, requiring a huge amount of labor-intensive human effort.

Nonetheless, despite my might enthusiasm for the question, the general logic of the study, and the major effort to create the training set, I do have a few significant concerns about the paper:

The biggest problem in the paper in my view is that although regular Alexnet saw faces in the training set, it was not trained on face discrimination, and its performance on this task is very low (66%). That is above chance but very much lower than a network that is actually trained on face discrimination. In our studies, which are typical of this literature, we find that when Alexnet is trained on the VGG-Face dataset identification of novel faces is around 85% correct (top-1). So to say that the face-deprived network performed no differently from the face-experienced network on a face discrimination task, while true, is misleading, because really this reflects the fact that neither was trained on face discrimination and both do pretty badly. And perhaps more importantly, for faces humans have learned, their typical face recognition accuracy would be way higher than 66% correct. So, the face-deprived network really does very badly compared to a real face-trained network, or to humans, and does not represent a strong case of preserved face discrimination despite lack of face experience. Instead, it reflects the kind of face recognition performance one would expect from an object recognition system or a prosopagnosic patient: above chance but not very accurate. Thus, I think the behavioral data show not preservation of face perception abilities in a network trained without faces, but low performance at face discrimination, much like a network that has seen faces but not been trained to discriminate them.

The claim that "face-selective channels already emerged in the d-AlexNet" is similarly overstated in my view, given that only two such units were found and the selectivity of the one we are shown (on the right in Figure 2a) is weak. Although the authors concede that the selectivity of these two units is lower than found in Alexnet trained with faces, that understates the case, as Figure 2a shows. The analysis in Figure 2b, correlating responses of face-selective channels from Alexnet to natural movies, with brain responses to the same movies, is interesting but doesn't tell us what we most need to know. Several public data sets include the magnitude of response of FFA and OFA to a set of 50-100 images, and I would find it more useful to compare those to the response of Alexnet face units to the same images.

A small point: Only human and primate faces were removed from the dataset, but I would think other animal faces (e.g. cats and dogs) should produce some relevant training. Certainly face-selective regions in the human brain respond strongly to animal faces, as several studies have shown. This might be worth considering in the discussion when potential reasons for the emergence of face-selective channels are discussed (line 229-236).

For the reasons above, I don't think the results of this study strongly support the conclusion that "the visual experience of faces was not necessary for an intelligent system to develop a face-selective module". At least the "face-specific module" so claimed is a far cry from the human face processing system in both neurally measured selectivity and behavioral performance.
Read the original source
eLife
Sep 29, 2020

##Preprint Review

This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 3 of the manuscript. Thomas Serre served as the Reviewing Editor.

###Summary:

In general, the reviewers and myself agreed that the study had strength including the question being asked and the general strategy used. We also thought that it was a great use of deep nets to ask what kinds of structures can in principle arise with different kinds of visual training diets. The authors should also be commended for the huge effort that went into curating ImageNet to remove images containing faces requiring a huge amount of labor-intensive human …

##Preprint Review

This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 3 of the manuscript. Thomas Serre served as the Reviewing Editor.

###Summary:

In general, the reviewers and myself agreed that the study had strength including the question being asked and the general strategy used. We also thought that it was a great use of deep nets to ask what kinds of structures can in principle arise with different kinds of visual training diets. The authors should also be commended for the huge effort that went into curating ImageNet to remove images containing faces requiring a huge amount of labor-intensive human effort.

At the same time, as you will see, the reviewers found a number of shortcomings in your study. Most of them could be addressed with (a lot of) additional work but, unfortunately, one issue raised seems impossible to convincingly address. Specifically, the accuracy of both the face-deprived network and the control network for face discrimination is far below that of both comparable networks specifically trained for face discrimination and most likely human observers (although this was not tested). Hence, the study does not represent a strong case of preserved face discrimination despite lack of face experience. To paraphrase the reviewer: "Instead, it reflects the kind of face recognition performance one would expect from an object recognition system or a prosopagnosic patient: above chance but not very accurate. Thus, I think the behavioral data show not preservation of face perception abilities in a network trained without faces, but low performance at face discrimination, much like a network that has seen faces but not been trained to discriminate them."

Read the original source
Version published to 10.1101/2020.07.06.189407 on bioRxiv
Jul 7, 2020

GradLIME: A CNN Local Interpretation Model Based on Feature Gradient Activation

This article has 8 authors:
1. Jinwei Zhao
2. Jiedong Liu
3. Zhenghao Shi
4. Yu Liu
5. Majid Habib Khan
6. Wei Wang
7. Minhui Zhu
8. Xinhong Hei
This article has no evaluationsLatest version Sep 25, 2025
Facial expression discrimination emerges from neural subspaces shared with detection and identity

This article has 4 authors:
1. Maren Wehrheim
2. Shirin Taghian Alamooti
3. Hamidreza Ramezanpour
4. Kohitij Kar
This article has no evaluationsLatest version Aug 27, 2025
Using face averages to measure differential accuracy for demographic groups in facial recognition

This article has 2 authors:
1. James Daniel Dunn
2. David White
This article has no evaluationsLatest version Aug 29, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GradLIME: A CNN Local Interpretation Model Based on Feature Gradient Activation

Facial expression discrimination emerges from neural subspaces shared with detection and identity

Using face averages to measure differential accuracy for demographic groups in facial recognition