Object representation in a gravitational reference frame

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    In this study the authors show that neural tuning for object orientation in IT is unaffected by whole-body tilt, suggesting that neurons are encoding objects relative to the gravitational vertical. However, these observations could also be because IT neurons may encode object orientation relative to cues and not due to gravity, or due to dynamic, compensatory torsional eye movements made by the animals. With these concerns adequately addressed, this would be an important study showing that IT neurons may play a role not only in object recognition but more broadly in physical scene understanding.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

When your head tilts laterally, as in sports, reaching, and resting, your eyes counterrotate less than 20%, and thus eye images rotate, over a total range of about 180°. Yet, the world appears stable and vision remains normal. We discovered a neural strategy for rotational stability in anterior inferotemporal cortex (IT), the final stage of object vision in primates. We measured object orientation tuning of IT neurons in macaque monkeys tilted +25 and –25° laterally, producing ~40° difference in retinal image orientation. Among IT neurons with consistent object orientation tuning, 63% remained stable with respect to gravity across tilts. Gravitational tuning depended on vestibular/somatosensory but also visual cues, consistent with previous evidence that IT processes scene cues for gravity’s orientation. In addition to stability across image rotations, an internal gravitational reference frame is important for physical understanding of a world where object position, posture, structure, shape, movement, and behavior interact critically with gravity.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    When we tilt our heads, we do not perceive objects to be tilted or rotated. In this study, the authors investigate the underlying neural underpinnings by characterizing how neurons in monkey IT respond to objects when the entire body is tilted. They performed two experiments. In the first experiment, the authors record single neuron responses to objects rotating in the image plane, under two conditions - when the animals were tilted +20{degree sign} or -20{degree sign} relative to the gravitational vertical. Their main finding is that neural tuning curves for object orientation were highly correlated under these conditions. This high correlation is interpreted by the authors as indicative of encoding of object orientations relative to an absolute gravitational reference frame. To control for the possibility that the whole-body tilt could have induced compensatory torsional rotations of the eyes, the authors estimated the eye torsional rotation between the {plus minus}20{degree sign} whole-body tilt to be only {plus minus}6{degree sign}. In the second experiment, the authors recorded neural responses to objects rotated in the image plane with no whole-body tilt but with a visual horizon that could be tilted by the same {plus minus}20{degree sign} relative to the gravitational vertical. Here too they find many neurons whose tuning curves were correlated between the two horizon tilt conditions. Based on these results, the authors argue that IT neurons represent objects relative to the gravitational or absolute vertical.

    The question of whether the visual system encodes objects relative to the gravitational vertical is an interesting and basic one, and I commend the authors for attempting this question through systematic testing of object selectivity under conditions of whole-body tilt. However, I found this manuscript extremely difficult to read, with important analyses and controls described in a very cursory fashion. I also have several major concerns about these results.

    First, the high tuning correlation in the {plus minus}20{degree sign} whole-body tilt conditions could also occur if IT neurons encoded object orientation relative to other fixed contextual cues in the surrounding, such as the frame of the computer monitor. The authors ideally should have some experiment or analysis to address this potential confound, or else acknowledge that their findings can also be interpreted as the encoding of object orientation relative to contextual cues, which would dilute their overall conclusions.

    We think there are three possible interpretations of this comment. First, that visible edges, including the horizon and ground plane (in the scene stimuli), and the screen edges and other gravitationally aligned edges in the room, could serve as visual cues for the orientation of gravity. We agree with this wholeheartedly, and in fact showed a strong degree of gravitational alignment based purely on visual scene cues in Figures 3 and 4. This is consistent with our previous results suggest computation of gravity’s direction in the middle channel of IT (Vaziri et al., Neuron 2014; Vaziri and Connor, Current Biology 2016). Our findings would not be diluted by the fact that multiple cues, not just vestibular/somatosensory but also visual, could help in computing the direction of gravity.

    Second, that overlap between objects and horizon could produce a shape-configuration interaction that changes with object orientation and produces a tuning effect that remains consistent across monkey tilts. We agree this was a possibility, and that is why we tested neurons in the isolated object condition. We have added text to better explain this concern and the control importance of the isolated object condition in the discussion of Fig. 1: “The Fig. 1 example neuron was tested with both full scene stimuli (Fig. 1a), which included a textured ground surface and horizon, providing visual cues for the orientation of gravity, and isolated objects (Fig. 1b), presented on a gray background, so that primarily vestibular and somatosensory cues indicated the orientation of gravity. The contrast between the two conditions helps to elucidate the additional effects of visual cues on top of vestibular/somatosensory cues. In addition, the isolated object condition controls for the possibility that tuning is affected by a shape-configuration (i.e. overlapping orientation) interaction between the object and the horizon or by differential occlusion of the object fragment buried in the ground (which was done to make the scene condition physically realistic for the wide variety of object orientations that would otherwise appear improbably balanced on a hard ground surface).”

    The comparable results in the isolated object condition address the reasonable concern about the horizon/object shape configuration interaction.: “Similar results were obtained for a partially overlapping sample of 99 IT neurons tested with isolated object stimuli with no background (i.e. no horizon or ground plane) (Fig. 2b). In this case, 60% of neurons (32/53) showed significant correlation in the gravitational reference frame, 26% (14/53) significant correlation in the retinal reference frame, and within these groups 13% (7/53) were significant in both reference frames. The population tendency toward positive correlation was again significant in this experiment along both gravitational (p = 3.63 X 10–22) and retinal axes (p = 1.63 X 10–7). This suggests that gravitational tuning can depend primarily on vestibular/somatosensory cues for self-orientation.”

    Third, that the object and screen edges in the isolated object condition have an orientation interaction that influences tuning in a way that remains consistent across monkey tilt. If this was intended, we do not think this is a reasonable concern that needs mentioning in the paper itself. The closest screen edges on our large display were 28 in the periphery, and there is no reason to suspect that IT encodes orientation relationships between distant, disconnected visual elements. Screen edges have been present in all or most studies of IT, and no such interactions have been reported. We will discuss this point in online responses.

    Second, I do not fully understand torsional eye movements myself, but it is not clear to me whether this is a fixed or dynamic compensation. For instance, have the authors measured torsional eye rotations on every trial? Is it fixed always at {plus minus}6{degree sign} or does it change from trial to trial? If it changes, then could the high tuning correlation between the whole-body rotations be simply driven by trials in which the eyes compensated more? The authors must provide more data or analyses to address this important control.

    We now clarify that we could only measure ocular rotation outside the experiment with high-resolution closeup color photography, not possible on individual trials. The extensive literature on ocular counter-rotation has no indication that the degree of rotation is changed by any conditions other than tilt. Our measurements were consistent with previous reports showing that counterroll is limited to 20% of tilt. Moreover, they are consistent with our analyses showing that maximum correlation with retinal coordinates is obtained with a 6 correction for counterroll, indicating equivalent counterroll during experiments. Our analytical compensation for counterroll was based on this value, which optimized results in the retinal reference frame, so our measurements of counter-roll are used only to confirm this value. Ocular rotation would need to be five times greater than any previous observations to completely compensate for tilt and mimic the gravitational tuning we observed. For these reasons, counterroll is not a reasonable explanation for our results:

    “Compensatory ocular counter-rolling was measured to be 6 based on iris landmarks visible in high-resolution photographs, consistent with previous measurements in humans6,7, and larger than previous measurements in monkeys41, making it unlikely that we failed to adequately account for the effects of counterroll. Eye rotation would need to be five times greater than previously observed to mimic gravitational tuning. Our rotation measurements required detailed color photographs that could only be obtained with full lighting and closeup photography. This was not possible within the experiments themselves, where only low-resolution monochromatic infrared images were available. Importantly, our analytical compensation for counter-rotation did not depend on our measurement of ocular rotation. Instead, we tested our data for correlation in retinal coordinates across a wide range of rotational compensation values. The fact that maximum correspondence was observed at a compensation value of 6 (Figure 1–figure supplement 1) indicates that counterrotation during the experiments was consistent with our measurements outside the experiments.”

    Third, I find that when the objects were presented against a visual horizon, different object features are occluded at each orientation. This could reduce the correlation between the neural response in the retinal reference frame, thereby biasing all results away from purely retinal encoding. The authors should address this either through additional analyses or acknowledge this issue appropriately throughout.

    This idea of a shape interaction between object and horizon/ground is essentially the same concern discussed as the second interpretation of the first point, above. As outlined there, we addressed this concern in the best way possible, by removing the horizon/background (in the isolated object condition) and showing that the same results obtained. This comment raises the related point (also cured by the isolated object condition) of differential partial occlusion at the bottom of the object, 15% (by virtual mass) of which was buried below ground to provide a realistic physical interpretation for unbalanced orientations.

    We make both concerns explicit in the revised manuscript: “The Fig. 1 example neuron was tested with both full scene stimuli (Fig. 1a), which included a textured ground surface and horizon, providing visual cues for the orientation of gravity, and isolated objects (Fig. 1b), presented on a gray background, so that primarily vestibular and somatosensory cues indicated the orientation of gravity. The contrast between the two conditions helps to elucidate the additional effects of visual cues on top of vestibular/somatosensory cues. In addition, the isolated object condition controls for the possibility that tuning is affected by a shape-configuration (i.e. overlapping orientation) interaction between the object and the horizon or by differential occlusion of the object fragment buried in the ground (which was done to make the scene condition physically realistic for the wide variety of object orientations that would otherwise appear improbably balanced on a hard ground surface).”

    And we report that the control produces similar results in the absence of horizon/background: “Similar results were obtained for a partially overlapping sample of 99 IT neurons tested with isolated object stimuli with no background (i.e. no horizon or ground plane) (Fig. 2b). In this case, 60% of neurons (32/53) showed significant correlation in the gravitational reference frame, 26% (14/53) significant correlation in the retinal reference frame, and within these groups 13% (7/53) were significant in both reference frames. The population tendency toward positive correlation was again significant in this experiment along both gravitational (p = 3.63 X 10–22) and retinal axes (p = 1.63 X 10–7). This suggests that gravitational tuning can depend primarily on vestibular/somatosensory cues for self-orientation.”

    Reviewer #3 (Public Review):

    This is a very interesting study examining for the first time the influence of lateral tilt of the whole body on orientation tuning in macaque IT. They employed two types of displays: one in which the object was embedded in a scene that had a horizon and textured ground surface, and a second one with only the object. For the first type, they examined the orientation tuning with and without tilting the subject. However, the effect of tilt for the scene stimuli is difficult to interpret in terms of gravitational reference frame since varying the orientation of the object relative to the horizon leads to changes in visual features between the horizon and object. If neurons show tolerance for the global orientation of the scene (within the 50{degree sign} manipulation range) then the consistent orientation tuning across tilts may just reflect tuning for the object-horizon features (like the angle between the object and the horizon line/surface) that is tolerant for the orientation of the whole scene. Thus, the effects of tilt can be purely visually-driven in this case and may reflect feature selectivity unrelated to gravitation. The difference between retinal and gravitational effects can just reflect neurons that do not care about the scene/horizon background but only about the object and neurons that respond to the features of the object relative to the background. Thus, I feel that the data using scenes cannot be used unambiguously as evidence for a gravitational reference frame. The authors also tested neurons with an object without a scene, and these data provide evidence for a gravitational reference frame. The authors should concentrate on these data and downplay the difficult-to-interpret results using scenes.

    We still believe it is important to present these two experimental conditions in parallel, because we believe that visual driving of gravitational tuning by environmental cues is important in real life, and this is substantiated by the effects of visual cues alone. But, we have tried in this revision, in response to these comments and to comments from other reviewers, to clarify the potential concerns about visual effects in the full scene experiment, the importance and meaning of the isolated object condition as a control for concerns about other kinds of tuning, and the relationships between the two experimental conditions:

    Concerns about full scene experiment and the control importance of the isolated object condition: “The Fig. 1 example neuron was tested with both full scene stimuli (Fig. 1a), which included a textured ground surface and horizon, providing visual cues for the orientation of gravity, and isolated objects (Fig. 1b), presented on a gray background, so that primarily vestibular and somatosensory cues indicated the orientation of gravity. The contrast between the two conditions helps to elucidate the additional effects of visual cues on top of vestibular/somatosensory cues. In addition, the isolated object condition controls for the possibility that tuning is affected by a shape-configuration (i.e. overlapping orientation) interaction between the object and the horizon or by differential occlusion of the object fragment buried in the ground (which was done to make the scene condition physically realistic for the wide variety of object orientations that would otherwise appear improbably balanced on a hard ground surface) …

    Similar results were obtained for a partially overlapping sample of 99 IT neurons tested with isolated object stimuli with no background (i.e. no horizon or ground plane) (Fig. 2b). In this case, 60% of neurons (32/53) showed significant correlation in the gravitational reference frame, 26% (14/53) significant correlation in the retinal reference frame, and within these groups 13% (7/53) were significant in both reference frames. The population tendency toward positive correlation was again significant in this experiment along both gravitational (p = 3.63 X 10–22) and retinal axes (p = 1.63 X 10–7). This suggests that gravitational tuning can depend primarily on vestibular/somatosensory cues for self-orientation. However, we cannot rule out a contribution of visual cues for gravity in the visual periphery, including screen edges and other horizontal and vertical edges and planes, which in the real world are almost uniformly aligned with gravity and thus strong cues for its orientation (but see Figure 2–figure supplement 1). Nonetheless, the Fig. 2b result confirms that gravitational tuning did not depend on the horizon or ground surface in the background condition.”

    Cell-by-cell comparisons of scene and isolated stimuli, for those cells tested with both, in Figure 2–figure supplement 6. This figure shows 8 neurons with significant gravitational tuning only in the floating object condition, 11 neurons with tuning only in the gravitational condition, and 23 neurons with significant tuning in both. Thus, a majority of significantly tuned neurons were tuned in both conditions. A two-tailed paired t-test across all 79 neurons tested in this way showed that there was no significant tendency toward stronger tuning in the scene condition. The 11 neurons with tuning only in the gravitational condition by themselves might suggest a critical role for visual cues in some neurons. However, the converse result for 8 cells, with tuning only in the floating condition, suggests a more complex dependence on cues or a conflicting effect of interaction with the background scene for a minority of cells.

    Main text: “This is further confirmed through cell-by-bell comparison between scene and isolated for those cells tested with both (Figure 2–figure supplement 6).”

    Furthermore, the analysis of the single object data should be improved and clarified.

    We have added Figure 1–figure supplement 3–10 that expand the analysis of example cells and additional cells to include all stimuli shown and smoothed tuning curves for individual repetitions of the orientation range.

    We also now present results for individual monkeys in Figure 2–supplements 2,3, and the anatomical locations of individual neurons in Figure 2–supplements 4,5.

  2. eLife assessment

    In this study the authors show that neural tuning for object orientation in IT is unaffected by whole-body tilt, suggesting that neurons are encoding objects relative to the gravitational vertical. However, these observations could also be because IT neurons may encode object orientation relative to cues and not due to gravity, or due to dynamic, compensatory torsional eye movements made by the animals. With these concerns adequately addressed, this would be an important study showing that IT neurons may play a role not only in object recognition but more broadly in physical scene understanding.

  3. Reviewer #1 (Public Review):

    When we tilt our heads, we do not perceive objects to be tilted or rotated. In this study, the authors investigate the underlying neural underpinnings by characterizing how neurons in monkey IT respond to objects when the entire body is tilted. They performed two experiments. In the first experiment, the authors record single neuron responses to objects rotating in the image plane, under two conditions - when the animals were tilted +20{degree sign} or -20{degree sign} relative to the gravitational vertical. Their main finding is that neural tuning curves for object orientation were highly correlated under these conditions. This high correlation is interpreted by the authors as indicative of encoding of object orientations relative to an absolute gravitational reference frame. To control for the possibility that the whole-body tilt could have induced compensatory torsional rotations of the eyes, the authors estimated the eye torsional rotation between the {plus minus}20{degree sign} whole-body tilt to be only {plus minus}6{degree sign}. In the second experiment, the authors recorded neural responses to objects rotated in the image plane with no whole-body tilt but with a visual horizon that could be tilted by the same {plus minus}20{degree sign} relative to the gravitational vertical. Here too they find many neurons whose tuning curves were correlated between the two horizon tilt conditions. Based on these results, the authors argue that IT neurons represent objects relative to the gravitational or absolute vertical.

    The question of whether the visual system encodes objects relative to the gravitational vertical is an interesting and basic one, and I commend the authors for attempting this question through systematic testing of object selectivity under conditions of whole-body tilt. However, I found this manuscript extremely difficult to read, with important analyses and controls described in a very cursory fashion. I also have several major concerns about these results.

    First, the high tuning correlation in the {plus minus}20{degree sign} whole-body tilt conditions could also occur if IT neurons encoded object orientation relative to other fixed contextual cues in the surrounding, such as the frame of the computer monitor. The authors ideally should have some experiment or analysis to address this potential confound, or else acknowledge that their findings can also be interpreted as the encoding of object orientation relative to contextual cues, which would dilute their overall conclusions.

    Second, I do not fully understand torsional eye movements myself, but it is not clear to me whether this is a fixed or dynamic compensation. For instance, have the authors measured torsional eye rotations on every trial? Is it fixed always at {plus minus}6{degree sign} or does it change from trial to trial? If it changes, then could the high tuning correlation between the whole-body rotations be simply driven by trials in which the eyes compensated more? The authors must provide more data or analyses to address this important control.

    Third, I find that when the objects were presented against a visual horizon, different object features are occluded at each orientation. This could reduce the correlation between the neural response in the retinal reference frame, thereby biasing all results away from purely retinal encoding. The authors should address this either through additional analyses or acknowledge this issue appropriately throughout.

  4. Reviewer #2 (Public Review):

    In this paper, the authors investigate the intriguing question of what orientation reference frame the visual selectivity of neurons in the IT cortex is expressed in - a world-centered gravitational one, or a retinal one? To address this, the authors physically rotate a monkey to dissociate a gravitational from a retinal reference frame. They find surprising and compelling evidence that many cells encode selectivity in a gravitational frame. The finding raises questions about whether the function of the IT cortex is solely object recognition, or whether it might play an important role in physical scene understanding.

    In general, I found the paper clearly written, the analyses appropriate, and the results supportive of the conclusions. I think the work should spur new thinking about what the IT cortex is accomplishing. The notion that IT cells are receiving vestibular signals is likely to be unsettling for many who think of it as simply the endstage of a convolutional neural network.

  5. Reviewer #3 (Public Review):

    This is a very interesting study examining for the first time the influence of lateral tilt of the whole body on orientation tuning in macaque IT. They employed two types of displays: one in which the object was embedded in a scene that had a horizon and textured ground surface, and a second one with only the object. For the first type, they examined the orientation tuning with and without tilting the subject. However, the effect of tilt for the scene stimuli is difficult to interpret in terms of gravitational reference frame since varying the orientation of the object relative to the horizon leads to changes in visual features between the horizon and object. If neurons show tolerance for the global orientation of the scene (within the 50{degree sign} manipulation range) then the consistent orientation tuning across tilts may just reflect tuning for the object-horizon features (like the angle between the object and the horizon line/surface) that is tolerant for the orientation of the whole scene. Thus, the effects of tilt can be purely visually-driven in this case and may reflect feature selectivity unrelated to gravitation. The difference between retinal and gravitational effects can just reflect neurons that do not care about the scene/horizon background but only about the object and neurons that respond to the features of the object relative to the background. Thus, I feel that the data using scenes cannot be used unambiguously as evidence for a gravitational reference frame. The authors also tested neurons with an object without a scene, and these data provide evidence for a gravitational reference frame. The authors should concentrate on these data and downplay the difficult-to-interpret results using scenes. Furthermore, the analysis of the single object data should be improved and clarified.