Emergence of brain-like mirror-symmetric viewpoint tuning in convolutional neural networks
Curation statements for this article:-
Curated by eLife
eLife assessment
This computational study is a valuable empirical investigation into the common trait of neurons in brains and artificial neural networks: responding effectively to both objects and their mirror images and it focuses on uncovering conditions that lead to mirror symmetry in visual networks and the evidence convincingly demonstrates that learning contributes to expanding mirror symmetry tuning, given its presence in the data. Additionally, the paper delves into the transformation of face patches in primate visual hierarchy, shifting from view specificity to mirror symmetry to view invariance. It empirically analyzes factors behind similar effects in two network architectures, and key claims highlight the emergence of invariances in architectures with spatial pooling, driven by learning bilateral symmetry discrimination and importantly, these effects extend beyond faces, suggesting broader relevance. Despite strong experiments, some interpretations lack explicit support, and the paper overlooks pre-training emergence of mirror symmetry.
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (eLife)
Abstract
Primates can recognize objects despite 3D geometric variations such as in-depth rotations. The computational mechanisms that give rise to such invariances are yet to be fully understood. A curious case of partial invariance occurs in the macaque face-patch AL and in fully connected layers of deep convolutional networks in which neurons respond similarly to mirror-symmetric views (e.g. left and right profiles). Why does this tuning develop? Here, we propose a simple learning-driven explanation for mirror-symmetric viewpoint tuning. We show that mirror-symmetric viewpoint tuning for faces emerges in the fully connected layers of convolutional deep neural networks trained on object recognition tasks, even when the training dataset does not include faces. First, using 3D objects rendered from multiple views as test stimuli, we demonstrate that mirror-symmetric viewpoint tuning in convolutional neural network models is not unique to faces: it emerges for multiple object categories with bilateral symmetry. Second, we show why this invariance emerges in the models. Learning to discriminate among bilaterally symmetric object categories induces reflection-equivariant intermediate representations. AL-like mirror-symmetric tuning is achieved when such equivariant responses are spatially pooled by downstream units with sufficiently large receptive fields. These results explain how mirror-symmetric viewpoint tuning can emerge in neural networks, providing a theory of how they might emerge in the primate brain. Our theory predicts that mirror-symmetric viewpoint tuning can emerge as a consequence of exposure to bilaterally symmetric objects beyond the category of faces, and that it can generalize beyond previously experienced object categories.
Article activity feed
-
-
Author Response
eLife assessment
This computational study is a valuable empirical investigation into the common trait of neurons in brains and artificial neural networks: responding effectively to both objects and their mirror im- ages and it focuses on uncovering conditions that lead to mirror symmetry in visual networks and the evidence convincingly demonstrates that learning contributes to expanding mirror symmetry tuning, given its presence in the data. Additionally, the paper delves into the transformation of face patches in primate visual hierarchy, shifting from view specificity to mirror symmetry to view invariance. It empirically analyzes factors behind similar effects in two network architec- tures, and key claims highlight the emergence of invariances in architectures with spatial pooling, driven by learning bilateral …
Author Response
eLife assessment
This computational study is a valuable empirical investigation into the common trait of neurons in brains and artificial neural networks: responding effectively to both objects and their mirror im- ages and it focuses on uncovering conditions that lead to mirror symmetry in visual networks and the evidence convincingly demonstrates that learning contributes to expanding mirror symmetry tuning, given its presence in the data. Additionally, the paper delves into the transformation of face patches in primate visual hierarchy, shifting from view specificity to mirror symmetry to view invariance. It empirically analyzes factors behind similar effects in two network architec- tures, and key claims highlight the emergence of invariances in architectures with spatial pooling, driven by learning bilateral symmetry discrimination and importantly, these effects extend be- yond faces, suggesting broader relevance. Despite strong experiments, some interpretations lack explicit support, and the paper overlooks pre-training emergence of mirror symmetry.
As detailed above, we have now analyzed several convolutional architectures and made a direct link between the artificial neural networks and neuronal data to further support our claims (refer to Figure 6, S10- 13).
To address the concern about pre-training emergence of mirror symmetry, we conducted a new analysis inspecting unit-level response profile, following Baek and colleagues (2021). This analysis is described in detail below (response to R3). In brief, we found that the first fully connected layer in trained networks exhibits twice the number of mirror symmetric units found before training. In addition to our population-level observations (Fig. S2) and explicit training- dataset manipulations (Fig. 4), this finding supports the interpretation of training to discriminate among mirror- symmetric object categories as a major factor behind the emergence of mirror symmetric viewpoint tuning.
Reviewer 1 (Public Review):
By using deep convolutional neural networks (CNNs) as model for the visual system, this study aims at understanding and explaining the emergence of mirror-symmetric viewpoint tuning in the brain.
Major strengths of the methods and results:
- The paper presents comprehensive, insightful and detailed analyses investigating how mirror- symmetric viewpoint tuning emergence in artificial neural networks, providing significant and novel insights into this complex process.
- The authors analyze reflection equivariance and invariance in both trained and untrained CNNs’ convolutional layers. This elucidates how object categorization training gives rise to mirror-symmetric invariance in the fully-connected layers.
- By training CNNs on small datasets of numbers and a small object set excluding faces, the authors demonstrate mirror-symmetric tuning’s potential to generalize to untrained categories and the necessity of view-invariant category training for its emergence.
- A further analysis probes the contribution of local versus global features to mirror-symmetric units in the first fully-connected layer of a network. This innovative analysis convincingly shows that local features alone suffice for the emergence of mirror-symmetric tuning in networks.
- The results make a clear prediction that mirror-symmetric tuning should also emerge for other bilaterally symmetric categories, opening avenues for future neural studies.
We are grateful for your insightful feedback and the positive evaluation of our study on mirror-symmetric viewpoint tuning in neural networks. Your constructive comments considerably improved the manuscript. We eagerly look forward to exploring the future research avenues you have highlighted.
Major weaknesses of the methods and results:
Point 1.1) The authors propose a mirror-symmetric viewpoint tuning index, which, although innovative, complicates comparison with previous work and this choice is not well motivated. This index is based on correlating representational dissimilarity matrices (RDMs) with their flipped versions, a method differing from previous approaches.
We have revised the Methods section to clarify the motivation for the mirror-symmetric viewpoint tuning index we introduced.
Manuscript changes:
Previous work quantified mirror-symmetry in RDMs by comparing neural RDMs to an idealized mirror- symmetric RDM (see Fig. 3c-iii in [14]). Although highly interpretable, such an idealized RDM encompasses implicit assumptions about representational geometry that are unrelated to mirror-symmetry. For example, consider a neural RDM reflecting perfect mirror-symmetric viewpoint tuning and wherein for each view, the distances among all of the exemplars are equal. Such a neural RDM would fit an idealized mirror- symmetric RDM better than a neural RDM reflecting perfect mirror-symmetric viewpoint tuning but with non-equidistant exemplars. In contrast, the measure proposed in Eq. 2 equals 1.0 in both cases.
Point 1.2> Faces exhibit unique behavior in terms of the progression of mirror-symmetric viewpoint tuning and their training task and dataset dependency. Given that mirror-symmetric tuning has been identified in the brain for faces, it would be beneficial to discuss this observation and provide potential explanations.
We revised the caption of Figure S1 to explicitly address this point:
Manuscript changes:
For face stimuli, there is a unique progression in mirror-symmetric viewpoint tuning: the index is negative for the convolutional layers and it abruptly becomes highly positive when transitioning to the first fully connected layer. The negative indices in the convolutional layers can be attributed to the image-space asymmetry of non-frontal faces; compared to other categories, faces demonstrate pronounced front-back asymmetry, which translates to asymmetric images for all but frontal views (Fig. S8). The features that drive the highly positive mirror-symmetric viewpoint tuning for faces in the fully connected layers are training-dependent (Fig. S2), and hence, may reflect asymmetric image features that do not elicit equivariant maps in low-level representations; for example, consider a profile view of a nose. Note that cars and boats elicit high mirror- symmetric viewpoint tuning indices already in early processing layers. This early mirror-symmetric tuning is independent of training (Fig. S2), and hence, may be driven by low-level features. Both of these object categories show pronounced quadrilateral symmetry, which translates to symmetric images for both frontal and side views (Fig. S8).
Point 1.3: 3. Previous work reported critical differences between CNNs and neural represen- tations in area AL indicating that mirror-symmetric viewpoint tuning is less present than view invariance in CNNs compared to area AL. While such findings could potentially limit the use- fulness of CNNs as models for mirror-symmetric viewpoint tuning in the brain, they are not addressed in the study.
This point is now addressed explicitly in the caption of Figure S9:
Manuscript changes:
Yildirim and colleagues [14] reported that CNNs trained on faces, notably VGGFace, exhibited lower mirror- symmetric viewpoint tuning compared to neural representations in area AL. Consistent with their findings, our results demonstrate that VGGFace, trained on face identification, has a low mirror-symmetric viewpoint tuning index. This is especially notable in comparison to ImageNet-trained models such as VGG16. This difference between VGG16 and VGGFace can be attributed to the distinct characteristics of their training datasets and objective functions. The VGGFace training task consists of mapping frontal face images to identities; this task may exclusively emphasize higher-level physiognomic information. In contrast, training on recognizing objects in natural images may result in a more detailed, view-dependent representation. To test this potential explanation, we measured the average correlation-distance between the fc6 representations of different views of the same face exemplar in VGGFace and VGG16 trained on ImageNet. The average correlation-distance between views is 0.70±0.04 in VGGFace and 0.93±0.04 in VGG16 trained on ImageNet. The converse correlation distance between different exemplars depicted from the same view is 0.84±0.14 in VGGFace and 0.58±0.06 in VGG16 trained on ImageNet. Therefore, as suggested by Yildirim and colleagues, training on face identification alone may result in representations that cannot explain intermediate levels of face processing.
Point 1.4) The study’s results, while informative, are qualitative rather than quantitative, and lack direct comparison with neural data. This obscures the implications for neural mechanisms and their relevance to the broader field.
We addressed this point by conducting a quantitative comparison between the architectures of various networks and neural response patterns in monkey face patches (see Figures 6, S10-S13, appearing above).
Point 1.5) The study provides compelling evidence that learning to discriminate bilaterally symmetric objects (beyond faces) induces mirror-symmetric viewpoint tuning in the networks, qualitatively similar to the brain. Moreover, the results suggest that this tuning can, in principle, generalize beyond previously trained object categories. Overall, the study provides important conclusions regarding the emergence of mirror-symmetric viewpoint tuning in networks, and potentially the brain. However, the conducted analyses and results do not entirely address the question why mirror-symmetric viewpoint tuning emerges in networks or the brain. Specifically, the results leave open whether mirror-symmetric viewpoint tuning is indeed necessary to achieve view invariance for bilaterally symmetric objects.
We believe that mirror-symmetric viewpoint tuning is not strictly necessary for achieving view-invariance. However, it is a plausible path from view-dependence to view invariance. We addressed this point in the updated limitations subsection of the discussion.
Manuscript changes:
A second consequence of the simulation-based nature of this study is that our findings only establish that mirror-symmetric viewpoint tuning is a viable computational means for achieving view invariance; they do not prove it to be a necessary condition. In fact, previous modeling studies [10, 19, 61] have demonstrated that a direct transition from view-specific processing to view invariance is possible. However, in practice, we observe that both CNNs and the face-patch network adopt solutions that include intermediate representations with mirror-symmetric viewpoint tuning.
Taken together, this study moves us a step closer to uncovering the origins of mirror-symmetric tuning in networks, and has implications for more comprehensive investigations into this neural phenomenon in the brain. The methods of probing CNNs are innovative and could be applied to other questions in the field. This work will be of broad interest to cognitive neuroscientists, psychologists, and computer scientists.
We appreciate your acknowledgment of our study’s contribution to understanding mirror-symmetric tuning in networks and its wider implications in the field.
Reviewer 2 (Public Review);
Strengths
- The statements made in the paper are precise, separating observations from inferences, with claims that are well supported by empirical evidence. Releasing the underlying code repository further bolsters the credibility and reproducibility. I especially appreciate the detailed discussion of limitations and future work.
- The main claims with respect to the two convolutional architectures are well supported by thorough analyses. The analyses are well-chosen and overall include good controls, such as changes in the training diet. Going beyond ”passive” empirical tests, the paper makes use of the fully accessible nature of computational models and includes more ”causal” insertion and deletion tests that support the necessity and sufficiency of local object features.
- Based on modeling results, the paper makes a testable prediction: that mirror-symmetric viewpoint tuning is not specific to faces and can also be observed in other bilaterally symmetric objects such as cars and chairs. To test this experimentally in primates (and potentially other model architectures), the stimulus set is available online.
We express our gratitude for your constructive feedback. Your acknowledgment of the clarity of our statements and the robustness of our empirical evidence is greatly appreciated. We are also thankful for your recognition of our comprehensive analyses and the testable predictions arising from our work.
Point 2.1: Weaknesses
My main concern with this paper is in its choice of the two model architectures AlexNet and VGG. In an earlier study, Yildirim et al. (2020) found an inverse graphics network ”EIG” to better correspond to neural and behavioral data for face processing than VGG. All claims in the paper thus relate to a weaker model of the biological effects since this work does not analyze the EIG model. Since EIG follows an analysis-by-synthesis approach rather than standard classification training, it is unclear whether the claims in this paper generalize to this other model architecture. It is also unclear if the claims will hold for: 1) transformer architectures, 2) the HMAX architecture by Leibo et al. (2017) which has also been proposed as a computational explanation for mirror-symmetric tuning, and, as the authors note in the Discussion, 3) deeper architectures such as ResNet-50 which tend to better align to neural and behavioral data in general. These architectures include different computational motifs such as skip connections and a much smaller proportion of fully-connected layers which are a major focus of this work.
Overall, I thus view the paper’s claims as limited to AlexNet- and VGG-like architectures, both of which fall behind state-of-the-art in their alignment to primates in general and also specifically for mirror-symmetric viewpoint tuning.
We understand your concern regarding the choice of AlexNet and VGG architectures. The decision to focus on these models was driven by the need for a straightforward macroscopic correspondence between the layer structure of the artificial networks and the ventral visual stream. However, acknowledging this potential limitation of generality, we have expanded our analysis to include the EIG model, a transformer architecture, the HMAX model, and deeper convolutional architectures like ResNet-50 and ConvNeXt. Our revised analysis, detailed in Figures S1, S9, and S10-S13, incorporates these additional models and offers a comprehensive evaluation of their brain alignment and mirror-symmetric viewpoint tuning. We found that while the architectures indeed vary in their computational motifs, the emergence of mirror-symmetric viewpoint tuning is not exclusive to AlexNet and VGG. It occurs for every CNN we tested, exactly at the stage where equivariant feature maps are pooled globally. We believe that the new analyses extend the generality of our findings and remove the concern that our claims apply only to older, shallower networks.
For details, please refer to Point 1 in the ’Essential Revisions’ section.
Point 2.2: Minor weaknesses
- Figure 1A: since the relevance to primate brains is a major motivator of this work, the results from actual neural recordings should be shown and not just schematics. For instance, the mirror symmetry in AL is not as clean as the illustration (compare with Fig. 3 in Yildirim et al. 2020), and in the paper’s current form, this is not easily accessible to the reader.
Thank you for your feedback regarding the presentation of neural recordings in Figure 1A. We have updated Figure 1A to include actual neural RDMs instead of the previous schematic representations.
Point 2.3: 2. Figure 4 L832-845: The claims for the effect of training on mirror-symmetric viewpoint tuning are with respect to the training data only, but there are other differences between the models such as the number of epochs (250 for CIFAR-10 training, 200 for all other datasets), the learning rate (2.5 ∗ 10−4 for CIFAR-10, 10−4 for all others), the batch size (128 vs 64), etc. I do not expect these choices to make a major difference for your claims, but it would be much cleaner to keep everything but the training dataset consistent. Especially the different test accuracies worry me a bit (from 81% to 92%, and they appear different from the accuracy numbers in figure S4 e.g. for CIFAR-10 and asymSVHN), at the very least those should be comparable.
We addressed this point by retraining the models while holding most of the hyperparameters constant. Specifically, we standardized the number of epochs, batch size, and weight decay. The remaining differences are necessitated by the characteristics of the specific training image sets used (natural images versus digits). Please note that we do not directly contrast models trained on CIFAR-10 and SVHN; the controlled comparisons are conducted while holding the SVHN training images constant, and are not confounded by hyperparameter choice.
Manuscript changes:
The networks’ weights and biases were initialized randomly using the uniform He initialization [70]. We trained the models using 250 epochs and a batch size of 256 images. The CIFAR-10 network was trained using stochastic gradient descent (SGD) optimizer starting with a learning rate of 10−3 and momentum of 0.9. The learning rate was halved every 20 epochs. The SVHN/symSVHN/asymSVHN networks were trained using the Adam optimizer. The initial learning rate was set to 10−5 and reduced by half every 50 epochs. The hyper-parameters were determined using the validation data. The models reached around 83% test accuracy (CIFAR-10: 81%, SVHN: 89%, symSVHN: 83%, asymSVHN: 80%). Fig. S4 shows the models’ learning curves.
Point 2.4: 3. L681-685: The general statement made in the paper that ”deeper models lose their advantage as models of cortical representations” is not supported by the cited limited comparison on a single dataset. There are many potential confounds here with respect to prior work, e.g. the recording modality (fMRI vs electrodes), the stimulus set (62 images vs thousands), the models that were tested (9 vs hundreds), etc.
We agree that the recording modality and stimulus set may play a critical role in determining model ranking. Since we generalized the analyses to deeper models, we removed this statement from the paper. While we still believe that shallower networks may prove to be better models of the visual cortex, this empirical question is out of the scope of the current manuscript.
Reviewer 3
This study aimed to explore the computational mechanisms of view invariance, driven by the observation that in some regions of monkey visual cortex, neurons show comparable responses to (1) a given face and (2) to the same face but horizontally flipped. Here they study this known phenomenon using AlexNet and other shallow neural networks, using an index for mirror symmetric viewpoint tuning based on representational similarity analyses. They find that this tuning is enhanced at fully connected- or global pooling layers (layers which combine spatial information), and that the invariance is prominent for horizontal- but not vertical- or rotational transformations. The study shows that mirror tuning can be learned when a given set of images are flipped horizontally and given the same label, but not if they are flipped and given different labels. They also show that networks learn this tuning by focusing on local features, not global configurations.
We are grateful for your thorough reading, reflected by the comprehensive summary of our study and its main findings.
Point 3.1) I found the study to be a mixed read. Some analyses were fascinating: for example, it was satisfying to see the use of well-controlled datasets to increase or decrease the rate of mirror-symmetry tuning. The insertion- and deletion¬ experiments were elegant tests to probe the mechanisms of mirror symmetry, asking if symmetry could arise from (1) global feature configurations (in a holistic sense) vs. (2) local features, with stronger evidence for the latter. These two sets of results were successful and interpretable. They stand in contrast with the first analysis, which relies on observations that do not seem justified. Specifically, Figure 2D shows mirror-symmetry tuning across 11 stages of image processing, from pixels space to fully connected layers. It shows that images from different object categories evoke considerably different tuning index values. The explanation for this result is that some categories, such as ”tools,” have ”bilaterally symmetric structure,” but this is not explicitly measured anywhere. ”Boats” are described as having ”front-back symmetry,” more so than flowers. One imagines flowers being extremely symmetric, but perhaps that depends on the metric. What is the metric? At first I thought it was the mirror-symmetric viewpoint tuning index in the image (pixel) space, but this cannot be, as the index for faces and flowers is negative, cars have no symmetry, and boats are positive. To support these descriptions, one must have an independent variable (for object class symmetry) that can be related to the dependent variable (the mirror-symmetric viewpoint tuning index). If it exists, it is not a part of the Results section. This omission undermines other parts of the Results section: ”some car models have an approximate front-back symmetry...however, a flower typically does not...” ”Some,” ”typically:” how many in the dataset exactly, and how often?
We thank you for your insightful observation. You are correct that we did not refer to pixel-space symmetry; our descriptions relate to the 3D structure of the objects used in the study.
Following this comment, we objectively quantified the symmetry planes of the 3D objects. Unfortunately, we do not have direct access to the proprietary 3D meshes of these objects, only to their renders. Therefore, we devised measures that assess the symmetry of the 3D objects through the symmetry they elicit in the different 2D renders.
This analysis is described in the new supplemental figure S8. We believe that these measurements support the qualitative claims we made in the previous draft.
Point 3.2) The description of CIFAR-10 as having bilaterally symmetric categories - are all these categories equally symmetric? If not, would such variability matter in terms of these results?
When considering their 3D structure, all ten CIFAR10 categories exhibit pronounced left-right symmetry. These categories encompass vertebrate animals (birds, cats, deer, dogs, frogs, horses); They also include man-made vehicles (airplanes, cars, ships, and trucks), which, at least externally, are nearly perfectly symmetric by design. It is important to note that this symmetry pertains to the photographed 3D objects, rather than the images themselves, which could be highly asymmetric. Other axes of symmetry (e.g., back-front) in CIFAR10 cannot be measured without 3D representations of the objects.
Point 3.3) These assessments of object category symmetry values are made before experiments are presented, so they are not interpretations of the results, and it would be circular to write it otherwise.
We have changed the order so that the explanations follow the experimental results. This includes the relevant main text paragraph, as well as the relevant figure—both the order of panels and the phrasing of the figure caption.
Point 3.4) Overall, my bigger concern is that the framing is misleading or at best incomplete. The manuscript successfully showed that if one introduces left-right symmetry to a dataset, the network will develop population-level representations that are also bilaterally symmetric. But the study does not explain that the model’s architecture and random weight distribution are sufficient for symmetry tuning to emerge, without training, just to a much more limited degree. Baek et al. showed in 2021 that viewpoint-invariant face-selective units and mirror-symmetric units emerge in untrained networks (”Face detection in untrained deep neural networks”; this current manuscript cites this paper but does not mention that mirror symmetry is a feature of the 2021 study). This current study also used untrained networks as controls (Fig. 3), and while they were useful in showing that learning boosts symmetry tuning, the results also clearly show that horizontal-reflection invariance is far from zero. So, the simple learning-driven explanation for the mirror-symmetric viewpoint tuning for faces is wrong: while (1) network training and (2) pooling are mechanisms that charge the development of mirror-symmetric tuning, the lottery ticket hypothesis is enough for its emergence. Faces and numbers are simple patterns, so the overparameterization of networks is enough to randomly create units that are tuned to these shapes and to wire many of them together. How learning shapes this process is an interesting direction, especially now that this current study has outlined its importance.
We agree with the reviewer that random initialization may result in units that show mirror-symmetric viewpoint tuning for faces in the absence of training. In the revised manuscript, we quantify the occurrence of such units, first reported by Baek et al, in detail, and discuss the relation between Baek et al., 2021 and our work. In brief, our analysis affirms that units with mirror-symmetric viewpoint tuning for faces appear even in untrained CNNs, although we believe their rate is lower than previously reported. Regardless of the question of the exact proportion of such units, we believe it is unequivocal that at the population level, mirror-symmetric viewpoint tuning to faces (and other objects with a single plane of symmetry) is strongly training-dependent.
First, we refer the reviewer to Figure S2, which directly demonstrates the effect of training on the population-level mirror symmetric viewpoint tuning:
Note the non-mirror-symmetric reflection invariant tuning profile for faces in the untrained network.
Second, the above-zero horizontal reflection-invariance referred by the reviewer (Figure 3) is distinct from mirror- symmetric viewpoint tuning; the latter requires both reflection-invariance and viewpoint tuning. More importantly, it was measured with respect to all of the object categories grouped together; this includes objects with quadrilateral symmetry, which elicit mirror-symmetric viewpoint tuning even in shallow layers and without training. To clarify the confusion that this grouping might have caused, we repeated the measurement of invariance in fc6, separately for each 3D object category:
Disentangling the contributions of different categories to the reflection-invariance measurements, this analysis under-scores the necessity of training for the emergence of mirror-symmetric viewpoint symmetry.
Last, we refer the reviewer to Figure S5, which shows that the symmetry of untrained convolutional filters has a narrow, zero-centered distribution. Indeed, the upper limit of this distribution includes filters with a certain degree of symmetry. This level of symmetry, however, becomes the lower limit of the filters’ symmetry distribution following training.
Therefore, we believe that training induces a shift in the tuning of the unit population that is qualitatively distinct from, and not explained by, random-lottery-related mirror-symmetric viewpoint tuned units. In the revised manuscript, we clarify the distinction between mirror-symmetric viewpoint tuning at the population level and the existence of individual units showing pre-training mirror symmetric viewpoint tuning, as shown by Baek et al.
Manuscript changes: (Discussion section)
Our claim that mirror-symmetric viewpoint tuning is learning-dependent may seem to be in conflict with findings by Baek and colleagues [17]. Their work demonstrated that units with mirror-symmetric viewpoint tuning profile can emerge in randomly initialized networks. Reproducing Baek and colleagues’ analysis, we confirmed that such units occur in untrained networks (Fig. S15). However, we also identified that the original criterion for mirror-symmetric viewpoint tuning employed in [17] was satisfied by many units with asymmetric tuning profiles (Figs. S14 and S15). Once we applied a stricter criterion, we observed a more than twofold increase in mirror-symmetric units in the first fully connected layer of a trained network compared to untrained networks of the same architecture (Fig. S16). This finding highlights the critical role of training in the emergence of mirror-symmetric viewpoint tuning in neural networks also at the level of individual units.
Point 3.5) Finally, it would help to cite other previous demonstrations of equivariance and mirror symmetry in neural networks. Chris Olah, Nick Cammarata, Chelsea Voss, Ludwig Schubert, and Gabriel Goh of OpenAI wrote of this phenomenon in 2020 (Distill journal).
We added a reference to the study by Olah and colleagues (2020).
Manuscript changes: (Discussion section)
(see Olah and colleagues (2020) [60] for an exploration of emergent equivariance using activation maximiza- tion).
Point 3.6) Some other observations that might help:
I am enthusiastic about the experiments using different datasets to increase or decrease the rate of mirror-symmetry tuning (sets including CIFAR10, SVHN, symSVHN, asymSVHN); it is worth noting, however, that the lack of a ground truth metric for category symmetry is a problem here too. In the asymSVHN dataset, images are flipped and given different labels. If some categories are naturally symmetric after horizontal flips, such as images containing ”0” or ”8”, then changing the label is likely to disturb training. This would explain why the training loss is larger for this condition (Figure S4D).
We now acknowledge that the inclusion of digits 0 and 8 reduces the accuracy of asymSVHN:
Manuscript changes: (Figure S4 caption)
Note that the accuracy of asymSVHN might be negatively affected by the inclusion of relatively symmetric categories such as 0 and 8.
Our rationale for retaining these digits in the dataset was to manipulate the symmetry of the learned categories (compared to symSVHN) while keeping the images themselves constant.
Regarding ground-truth symmetry of these dataset: For CIFAR-10, the relevant measure of symmetry pertains to the 3D structure of the photographed objects, which we believe is unequivocally symmetric (see Point 3.2). Note that 2D, pixel-space image symmetry is not directly indicative of symmetry in 3D.
For SVHN, which consists of two-dimensional characters, the pixel-space symmetry of the images indeed reflects the objects’ symmetry. However, since we are worried that some readers might confuse our claims that relate to the symmetry of objects with claims (we did not make) about symmetry of 2D images, we prefer to avoid reporting measurements of image-space symmetry. We believe that our interpretation of the experiments with SVHN/symSVHN/asymSVHN holds even in the absence of such measurements.
For your reference, we include here a quantification of image-space horizontal symmetry for each category of CIFAR-10 and SVHN:
Point 3.7) It is puzzling why greyscale 3D rendered images are used. By using greyscale 3D render (at least as shown in the figures) the study proceeds as if the units are invariant under color transformations. Unfortunately, this is not true and using greyscale images impact the activations of different layers of Alexnet in a way that is not fully defined. Moreover, many units in shallow networks focus on color and exactly these units could be invariant to other transformation like the mirror symmetry, but grey scaling the images makes them inactive.
We use grayscale 3D rendered images to align with the setting in other studies investigating mirror- symmetric viewpoint tuning, including Freiwald et al. (2010), Leibo et al. (2017), and Yildirim et al. (2020). The choice of using grayscale images in these studies is motivated by the need to dissociate face-processing from lower-level, hue-specific responses.
-
eLife assessment
This computational study is a valuable empirical investigation into the common trait of neurons in brains and artificial neural networks: responding effectively to both objects and their mirror images and it focuses on uncovering conditions that lead to mirror symmetry in visual networks and the evidence convincingly demonstrates that learning contributes to expanding mirror symmetry tuning, given its presence in the data. Additionally, the paper delves into the transformation of face patches in primate visual hierarchy, shifting from view specificity to mirror symmetry to view invariance. It empirically analyzes factors behind similar effects in two network architectures, and key claims highlight the emergence of invariances in architectures with spatial pooling, driven by learning bilateral symmetry discrimination and …
eLife assessment
This computational study is a valuable empirical investigation into the common trait of neurons in brains and artificial neural networks: responding effectively to both objects and their mirror images and it focuses on uncovering conditions that lead to mirror symmetry in visual networks and the evidence convincingly demonstrates that learning contributes to expanding mirror symmetry tuning, given its presence in the data. Additionally, the paper delves into the transformation of face patches in primate visual hierarchy, shifting from view specificity to mirror symmetry to view invariance. It empirically analyzes factors behind similar effects in two network architectures, and key claims highlight the emergence of invariances in architectures with spatial pooling, driven by learning bilateral symmetry discrimination and importantly, these effects extend beyond faces, suggesting broader relevance. Despite strong experiments, some interpretations lack explicit support, and the paper overlooks pre-training emergence of mirror symmetry.
-
Reviewer #1 (Public Review):
By using deep convolutional neural networks (CNNs) as model for the visual system, this study aims at understanding and explaining the emergence of mirror-symmetric viewpoint tuning in the brain.
Major strengths of the methods and results:
(1) The paper presents comprehensive, insightful and detailed analyses investigating how mirror-symmetric viewpoint tuning emergence in artificial neural networks, providing significant and novel insights into this complex process.
(2) The authors analyze reflection equivariance and invariance in both trained and untrained CNNs' convolutional layers. This elucidates how object categorization training gives rise to mirror-symmetric invariance in the fully-connected layers.
(3) By training CNNs on small datasets of numbers and a small object set excluding faces, the authors …Reviewer #1 (Public Review):
By using deep convolutional neural networks (CNNs) as model for the visual system, this study aims at understanding and explaining the emergence of mirror-symmetric viewpoint tuning in the brain.
Major strengths of the methods and results:
(1) The paper presents comprehensive, insightful and detailed analyses investigating how mirror-symmetric viewpoint tuning emergence in artificial neural networks, providing significant and novel insights into this complex process.
(2) The authors analyze reflection equivariance and invariance in both trained and untrained CNNs' convolutional layers. This elucidates how object categorization training gives rise to mirror-symmetric invariance in the fully-connected layers.
(3) By training CNNs on small datasets of numbers and a small object set excluding faces, the authors demonstrate mirror-symmetric tuning's potential to generalize to untrained categories and the necessity of view-invariant category training for its emergence.
(4) A further analysis probes the contribution of local versus global features to mirror-symmetric units in the first fully-connected layer of a network. This innovative analysis convincingly shows that local features alone suffice for the emergence of mirror-symmetric tuning in networks.
(5) The results make a clear prediction that mirror-symmetric tuning should also emerge for other bilaterally symmetric categories, opening avenues for future neural studies.Major weaknesses of the methods and results:
(1) The authors propose a mirror-symmetric viewpoint tuning index, which, although innovative, complicates comparison with previous work and this choice is not well motivated. This index is based on correlating representational dissimilarity matrices (RDMs) with their flipped versions, a method differing from previous approaches.
(2) Faces exhibit unique behavior in terms of the progression of mirror-symmetric viewpoint tuning and their training task and dataset dependency. Given that mirror-symmetric tuning has been identified in the brain for faces, it would be beneficial to discuss this observation and provide potential explanations.
(3) Previous work reported critical differences between CNNs and neural representations in area AL indicating that mirror-symmetric viewpoint tuning is less present than view invariance in CNNs compared to area AL. While such findings could potentially limit the usefulness of CNNs as models for mirror-symmetric viewpoint tuning in the brain, they are not addressed in the study.
(4) The study's results, while informative, are qualitative rather than quantitative, and lack direct comparison with neural data. This obscures the implications for neural mechanisms and their relevance to the broader field.The study provides compelling evidence that learning to discriminate bilaterally symmetric objects (beyond faces) induces mirror-symmetric viewpoint tuning in the networks, qualitatively similar to the brain. Moreover, the results suggest that this tuning can, in principle, generalize beyond previously trained object categories. Overall, the study provides important conclusions regarding the emergence of mirror-symmetric viewpoint tuning in networks, and potentially the brain. However, the conducted analyses and results do not entirely address the question why mirror-symmetric viewpoint tuning emerges in networks or the brain. Specifically, the results leave open whether mirror-symmetric viewpoint tuning is indeed necessary to achieve view invariance for bilaterally symmetric objects.
Taken together, this study moves us a step closer to uncovering the origins of mirror-symmetric tuning in networks, and has implications for more comprehensive investigations into this neural phenomenon in the brain. The methods of probing CNNs are innovative and could be applied to other questions in the field. This work will be of broad interest to cognitive neuroscientists, psychologists, and computer scientists.
-
Reviewer #2 (Public Review):
Strengths
(1) The statements made in the paper are precise, separating observations from inferences, with claims that are well supported by empirical evidence. Releasing the underlying code repository further bolsters the credibility and reproducibility. I especially appreciate the detailed discussion of limitations and future work.
(2) The main claims with respect to the two convolutional architectures are well supported by thorough analyses. The analyses are well-chosen and overall include good controls, such as changes in the training diet. Going beyond "passive" empirical tests, the paper makes use of the fully accessible nature of computational models and includes more "causal" insertion and deletion tests that support the necessity and sufficiency of local object features.
(3) Based on modeling …
Reviewer #2 (Public Review):
Strengths
(1) The statements made in the paper are precise, separating observations from inferences, with claims that are well supported by empirical evidence. Releasing the underlying code repository further bolsters the credibility and reproducibility. I especially appreciate the detailed discussion of limitations and future work.
(2) The main claims with respect to the two convolutional architectures are well supported by thorough analyses. The analyses are well-chosen and overall include good controls, such as changes in the training diet. Going beyond "passive" empirical tests, the paper makes use of the fully accessible nature of computational models and includes more "causal" insertion and deletion tests that support the necessity and sufficiency of local object features.
(3) Based on modeling results, the paper makes a testable prediction: that mirror-symmetric viewpoint tuning is not specific to faces and can also be observed in other bilaterally symmetric objects such as cars and chairs. To test this experimentally in primates (and potentially other model architectures), the stimulus set is available online.
Weaknesses
My main concern with this paper is in its choice of the two model architectures AlexNet and VGG. In an earlier study, Yildirim et al. (2020) found an inverse graphics network "EIG" to better correspond to neural and behavioral data for face processing than VGG. All claims in the paper thus relate to a weaker model of the biological effects since this work does not analyze the EIG model. Since EIG follows an analysis-by-synthesis approach rather than standard classification training, it is unclear whether the claims in this paper generalize to this other model architecture. It is also unclear if the claims will hold for: 1) transformer architectures, 2) the HMAX architecture by Leibo et al. (2017) which has also been proposed as a computational explanation for mirror-symmetric tuning, and, as the authors note in the Discussion, 3) deeper architectures such as ResNet-50 which tend to better align to neural and behavioral data in general. These architectures include different computational motifs such as skip connections and a much smaller proportion of fully-connected layers which are a major focus of this work.
Overall, I thus view the paper's claims as limited to AlexNet- and VGG-like architectures, both of which fall behind state-of-the-art in their alignment to primates in general and also specifically for mirror-symmetric viewpoint tuning.
Minor weaknesses
(1) Figure 1A: since the relevance to primate brains is a major motivator of this work, the results from actual neural recordings should be shown and not just schematics. For instance, the mirror symmetry in AL is not as clean as the illustration (compare with Fig. 3 in Yildirim et al. 2020), and in the paper's current form, this is not easily accessible to the reader.
(2) Figure 4 / L832-845: The claims for the effect of training on mirror-symmetric viewpoint tuning are with respect to the training data only, but there are other differences between the models such as the number of epochs (250 for CIFAR-10 training, 200 for all other datasets), the learning rate (2.5 * 10^-4 for CIFAR-10, 10^-4 for all others), the batch size (128 vs 64), etc. I do not expect these choices to make a major difference for your claims, but it would be much cleaner to keep everything but the training dataset consistent. Especially the different test accuracies worry me a bit (from 81% to 92%, and they appear different from the accuracy numbers in figure S4 e.g. for CIFAR-10 and asymSVHN), at the very least those should be comparable.
(3) L681-685: The general statement made in the paper that "deeper models lose their advantage as models of cortical representations" is not supported by the cited limited comparison on a single dataset. There are many potential confounds here with respect to prior work, e.g. the recording modality (fMRI vs electrodes), the stimulus set (62 images vs thousands), the models that were tested (9 vs hundreds), etc.
-
Reviewer #3 (Public Review):
This study aimed to explore the computational mechanisms of view invariance, driven by the observation that in some regions of monkey visual cortex, neurons show comparable responses to (1) a given face and (2) to the same face but horizontally flipped. Here they study this known phenomenon using AlexNet and other shallow neural networks, using an index for mirror symmetric viewpoint tuning based on representational similarity analyses. They find that this tuning is enhanced at fully connected- or global pooling layers (layers which combine spatial information), and that the invariance is prominent for horizontal- but not vertical- or rotational transformations. The study shows that mirror tuning can be learned when a given set of images are flipped horizontally and given the same label, but *not* if they …
Reviewer #3 (Public Review):
This study aimed to explore the computational mechanisms of view invariance, driven by the observation that in some regions of monkey visual cortex, neurons show comparable responses to (1) a given face and (2) to the same face but horizontally flipped. Here they study this known phenomenon using AlexNet and other shallow neural networks, using an index for mirror symmetric viewpoint tuning based on representational similarity analyses. They find that this tuning is enhanced at fully connected- or global pooling layers (layers which combine spatial information), and that the invariance is prominent for horizontal- but not vertical- or rotational transformations. The study shows that mirror tuning can be learned when a given set of images are flipped horizontally and given the same label, but *not* if they are flipped and given different labels. They also show that networks learn this tuning by focusing on local features, not global configurations.
I found the study to be a mixed read. Some analyses were fascinating: for example, it was satisfying to see the use of well-controlled datasets to increase or decrease the rate of mirror-symmetry tuning. The insertion- and deletion¬ experiments were elegant tests to probe the mechanisms of mirror symmetry, asking if symmetry could arise from (1) global feature configurations (in a holistic sense) vs. (2) local features, with stronger evidence for the latter. These two sets of results were successful and interpretable. They stand in contrast with the first analysis, which relies on observations that do not seem justified. Specifically, Figure 2D shows mirror-symmetry tuning across 11 stages of image processing, from pixels space to fully connected layers. It shows that images from different object categories evoke considerably different tuning index values. The explanation for this result is that some categories, such as "tools," have "bilaterally symmetric structure," but this is not explicitly measured anywhere. "Boats" are described as having "front-back symmetry," more so than flowers. One imagines flowers being extremely symmetric, but perhaps that depends on the metric. What is the metric? At first I thought it was the mirror-symmetric viewpoint tuning index in the image (pixel) space, but this cannot be, as the index for faces and flowers is negative, cars have no symmetry, and boats are positive. To support these descriptions, one must have an independent variable (for object class symmetry) that can be related to the dependent variable (the mirror-symmetric viewpoint tuning index). If it exists, it is not a part of the Results section. This omission undermines other parts of the Results section: "some car models have an approximate front-back symmetry...however, a flower typically does not..." "Some," "typically:" how many in the dataset exactly, and how often? The description of CIFAR-10 as having bilaterally symmetric categories - are all these categories equally symmetric? If not, would such variability matter in terms of these results? These assessments of object category symmetry values are made before experiments are presented, so they are not interpretations of the results, and it would be circular to write it otherwise.
Overall, my bigger concern is that the framing is misleading or at best incomplete. The manuscript successfully showed that if one introduces left-right symmetry to a dataset, the network will develop population-level representations that are also bilaterally symmetric. But the study does not explain that the model's architecture and random weight distribution are sufficient for symmetry tuning to emerge, without training, just to a much more limited degree. Baek et al. showed in 2021 that viewpoint-invariant face-selective units and mirror-symmetric units emerge in untrained networks ("Face detection in untrained deep neural networks"; this current manuscript cites this paper but does not mention that mirror symmetry is a feature of the 2021 study). This current study also used untrained networks as controls (Fig. 3), and while they were useful in showing that learning boosts symmetry tuning, the results also clearly show that horizontal-reflection invariance is far from zero. So, the simple learning-driven explanation for the mirror-symmetric viewpoint tuning for faces is wrong: while (1) network training and (2) pooling are mechanisms that charge the development of mirror-symmetric tuning, the lottery ticket hypothesis is enough for its emergence. Faces and numbers are simple patterns, so the overparameterization of networks is enough to randomly create units that are tuned to these shapes and to wire many of them together. How learning shapes this process is an interesting direction, especially now that this current study has outlined its importance.
Finally, it would help to cite other previous demonstrations of equivariance and mirror symmetry in neural networks. Chris Olah, Nick Cammarata, Chelsea Voss, Ludwig Schubert, and Gabriel Goh of OpenAI wrote of this phenomenon in 2020 (Distill journal).
Some other observations that might help:
- I am enthusiastic about the experiments using different datasets to increase or decrease the rate of mirror-symmetry tuning (sets including CIFAR10, SVHN, symSVHN, asymSVHN); it is worth noting, however, that the lack of a ground truth metric for category symmetry is a problem here too. In the asymSVHN dataset, images are flipped and given different labels. If some categories are naturally symmetric after horizontal flips, such as images containing "0" or "8", then changing the label is likely to disturb training. This would explain why the training loss is larger for this condition (Figure S4D).
- It is puzzling why greyscale 3D rendered images are used. By using greyscale 3D render (at least as shown in the figures) the study proceeds as if the units are invariant under color transformations. Unfortunately, this is not true and using greyscale images impact the activations of different layers of Alexnet in a way that is not fully defined. Moreover, many units in shallow networks focus on color and exactly these units could be invariant to other transformation like the mirror symmetry, but grey scaling the images makes them inactive.
-
-