Invariant representation of physical stability in the human brain

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This is an intriguing study using cleverly designed stimuli to investigate the representation of physical stability in the human brain. This paper will be of interest to readers wondering when human cognition uses generalizable pattern matching similar to that used by machine learning algorithms, and when it relies on more specialized processes evolved for specific tasks. The well-crafted experiments generally support the authors' major claim.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Successful engagement with the world requires the ability to predict what will happen next. Here, we investigate how the brain makes a fundamental prediction about the physical world: whether the situation in front of us is stable, and hence likely to stay the same, or unstable, and hence likely to change in the immediate future. Specifically, we ask if judgments of stability can be supported by the kinds of representations that have proven to be highly effective at visual object recognition in both machines and brains, or instead if the ability to determine the physical stability of natural scenes may require generative algorithms that simulate the physics of the world. To find out, we measured responses in both convolutional neural networks (CNNs) and the brain (using fMRI) to natural images of physically stable versus unstable scenarios. We find no evidence for generalizable representations of physical stability in either standard CNNs trained on visual object and scene classification (ImageNet), or in the human ventral visual pathway, which has long been implicated in the same process. However, in frontoparietal regions previously implicated in intuitive physical reasoning we find both scenario-invariant representations of physical stability, and higher univariate responses to unstable than stable scenes. These results demonstrate abstract representations of physical stability in the dorsal but not ventral pathway, consistent with the hypothesis that the computations underlying stability entail not just pattern classification but forward physical simulation.

Article activity feed

  1. Author Response

    Reviewer #2 (Public Review):

    This paper combines neuroimaging, behavioral experiments, and computational modeling to argue that (a) there is a network of brain areas that represent physical stability, (b) these areas do so in a way that generalizes across many kinds of instability (e.g., not only a tower of blocks about to fall over, but also a person about to fall off a ladder), and (c) that this supports a simulation account of physical reasoning, rather than one based on feedforward processing; this last claim arises through a comparison of humans to CNNs, which do an OK job classifying physical instability but not in a way that transfers across these different stability classes. In my opinion, this is a lovely contribution to the literatures on both intuitive physical reasoning and (un)humanlike machine vision. At the same time, I wasn't sure that the broader conclusions followed from the data in the way the authors preferred, and I also had some concerns about some of the methodological choices made here.

    1. The following framing puzzled me a bit, and even seemed to raise an unaddressed confound in the paper: "Here we investigate how the brain makes the most basic prediction about the physical world: whether the situation in front of us is stable, and hence likely to stay the same, or unstable, and hence likely to change in the immediate future".

    Consider the following minor worry, which sets up a more major one: This framing, which connects 'stability' to 'change' and which continues throughout the paper, seems to equivocate on the notion of 'stability'. One meaning of 'stable' is, roughly, 'unchanging'. Another meaning is 'unlikely to fall over'. The above quotation, along with others like it, makes it seem like the authors are investigating the former, since that's the only meaning that makes this quotation make sense. But in fact the experiments are about the latter -- towers falling down, people falling off ladders, etc. But these aren't the same thing! So there's a bit of wordplay happening here, it seemed to me.

    This sets up the more serious worry. As this framing reveals, unstable scenes (in the likely-to-fall-over sense) are, by their nature, scenes where something is likely to change. In that case, how do we know that the brain areas this project has identified aren't representing 'likeliness to change', rather than physical stability? There are, of course, many objects and scenes that might be highly likely to change without being at all physically unstable. Even the first example in the paper ("a dog about to give chase") is about likely changes without any physical instability. But isn't this a confound? All of the examples of physical instability explored here also involve likeliness to change! So these could be 'likely to change' brain areas, not 'physically unstable' brain areas. Right? Or if not, what am I missing?

    The caption of Figure 1 seems to get at this a bit, but in a way I admit I just found a bit confusing. If authors do after all intend "physically unstable" to mean "likely to change", then many classes of scenarios that are unexplored here seem like they would be relevant: a line of sprinters about to dash off in a race, someone about to turn off all the lights in a home, a spectacular chemical reaction about to start, etc. But the authors don't intend those scenarios to fall under the current project, right?

    The reviewer is correct that "stability" has (at least) these two different meanings, and also correct that we are investigating here the situation in which a configuration is not changing now but would be likely to change with just the slightest perturbation. Our hypothesis is that the “Physics Network” will be sensitive to the likelihood that a physical configuration will change for physical (not social) reasons. That is what our data show: we do not find the same univariate and multivariate effects for situations that are likely to change because of the behavior of an animal. This indicates that what we are decoding is not general ‘likeliness to change’ but rather physical instability in particular.

    (Also: Is stability really 'the most basic prediction' we make about the world? Who is to say that stable vs. unstable is a more basic judgment than, say, present vs. absent, or expected vs. unexpected, or safe vs. unsafe, etc? I know this is mostly just trying to get the reader excited about the results, but I stumbled there.)

    We have now modified the sentence to say: “…how the brain makes a fundamental prediction about the physical world: whether the situation in front of us is stable, and hence likely to stay the same, or unstable, and hence likely to change in the immediate future.”

    1. Laying out these issues in terms of feedforward processing vs. simulation felt a bit misleading and/or unfair to those views, given the substance of what this paper is actually doing. In particular, the feedforward view ends up getting assimilated to "what CNNs do"; but these are completely different hypotheses (or at least can be). Note, for example, that many vision researchers who don't think CNNs are good models of human vision nevertheless do think that lots of what human vision does is feedforward; that view could only be coherent if there are kinds of feedforward processing that are un-CNN-like. It would be better not to conflate these two and just say that the pattern of results rules out CNN-like feedforward processing without ruling out feedforward processing in general.

    This is a fair point, and we certainly agree that we cannot rule out all feedforward models. We have tried to be clear about this claim, e.g., here (in the Discussion: “Three lines of evidence from the present study indicate that pattern recognition alone – as instantiated in feedforward CNNs and the ventral visual pathway – is unlikely to explain physical inference in humans, at least for the case of physical stability."

    3a. I wasn't sure how impressed to be by the fact that, say, 60% classification accuracy one class of stable/unstable scenes doesn't lead to above-chance performance on another class of stable/unstable scenes. Put differently, it seems that the CNNs simply didn't do a great job classifying physical stability in the first place; in that case, how general should we expect their representations to be anyway? Now, on one hand, I could see this worry only further supporting the authors' case, since you could think of this as all the more evidence that CNNs won't have representations of stability in them. But since (a) the claims the authors are making are about feedforward processing in principle, not just in one or two CNNs, and (b) the purpose of this paper is to explore the issue of generality per se, rather than just stability, this seems inadequate. It could be that a CNN that does achieve high accuracy on physical stability judgments (90%?) would actually show this kind of general transfer; but we don't know that from the data presented here, because it's possible that the lack of generality arises from poor performance to begin with.

    You are correct in noting that CNNs don’t do a great job in classifying physical stability, which reinforces our point that pattern recognition systems are not very good at discerning physical stability. In fact, the classification accuracy that we have reported is close to the baseline performance in literature (Lerer et al 2016). Interestingly, training on the block tower dataset itself could only bring up the stability classification accuracy to 68.8% on the real-world block tower images. While this is true of the current best model of stability detection, we think that CNNs trained on large-scale datasets of stability under varying scenarios may in future be able to potentially generalize to other natural scenarios. However, to our knowledge no such datasets exist.

    3b. I wasn't sure how to think about whether showing CNNs stable and unstable scenes is a fair test of their ability to represent physical stability. Do we know that stability is all that these images have in common? Maybe the CNN is doing a great job learning some other representation. This sort of thing comes up in some recent discussions of 'shortcuts' and/or the 'fairness' of comparisons between human and machine vision, including some recent theoretical papers (see author recommendations for specific suggestions here).

    If our point were that CNNs do a great job at representing physical stability, we would indeed have to worry about low-level image confounds or “shortcuts” enabling this performance. But our point is that they do badly. If some of their already bad performance is due to image confounds/shortcuts then they are in fact doing even worse, and that only makes our point stronger.

    4a. I didn't really follow this passage, which is relied on to interpret greater activity for unstable vs stable scenes: "we reasoned that if the candidate physics regions are engaged automatically in simulating what will happen next, they should show a higher mean response when viewing physically unstable scenes (because there is more to simulate) than stable scenes (where nothing is predicted to happen)." It seems true enough that, once one knows that a scene is stable, one doesn't then need a dynamically updated representation of its unfolding. But the question that this paper is about is how we determine, in the first place, that a scene is stable or not. The simulations at issue are simulations one runs before one knows their outcome, and so it wasn't clear at all to me that there is always more to simulate in an unstable scene. Stable scenes may well have a lot to simulate, even if we determine after those hefty simulations that the scene is stable after all. And of course unstable scenes might well have very little to simulate, if the scene is simple and the instability is straightforwardly evident. Can the authors say more about why it's easier to determine that a stable scene is stable than that an unstable scene is unstable? They may have a good answer! It would just be better to see it in the paper.

    The idea here is that forward simulation happens in all cases but stops if no change has occurred since the last frame. That stopping, both represents the stability of the configuration and produces less activity. This idea is akin to the “sleep state” used for nonmoving objects in a physics engine: they do not need to be re-simulated or re-rendered if they have not moved since the last frame (Ullman et al, 2017 TICS).

    4b. I was confused a bit by the Animals-People condition, and whether to think of it as a control condition or not. The image of it in Figure 1a makes it seem like it is meant to be interpreted along the usual "physical stability" lines, just like falling towers and people on ladders, and the caption seems to say this too; it also makes intuitive sense since the man in the boat looks like he'll fall if and when the alligator attacks. But then in the main text the authors predict that the representations of stability would not extend to the Animals-People condition, because they are just supposed to be about peril but not stability. Why not? And then the results themselves are equivocal, with some findings generalizing to Animals-People and some not. I don't have much more to say here other than that I found this hard to follow.

    We used the Animals-People as a control for peril/instability that is not caused by the physical situation (but rather by another agent). Our hypothesis was that the “Physics Network” would hold information about physical stability, not just any kind of propensity for change for any reason. Hence, we predicted, that any brain region responding (only) to physical stability should not respond in a similar way to peril/non-peril conditions in the Animals-People scenario as they involve a more biological-agent driven interaction. That is what we found.

  2. Evaluation Summary:

    This is an intriguing study using cleverly designed stimuli to investigate the representation of physical stability in the human brain. This paper will be of interest to readers wondering when human cognition uses generalizable pattern matching similar to that used by machine learning algorithms, and when it relies on more specialized processes evolved for specific tasks. The well-crafted experiments generally support the authors' major claim.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #3 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    Pramod and colleagues find that inferred physical stability of visual scenes is represented in parietal-frontal areas designated as the "Physics Network", but not in the ventral visual pathway. Furthermore, unlike in previous studies, they report that physical stability cannot be determined from representations in standard CNNs trained on object classification. These novel findings are the result of studying inferred stability not in one type of image only, but by generalizing between different types of scenes (object- and people-dominated scenes, respectively). The authors combine a number of sophisticated measurement and analysis techniques to substantiate their claims: CCNs, fMRI in combination with MVPA, eye-tracking, and clever control conditions.

    The authors speculate that this "Physics Network" contains a generative model, running forward simulations of dynamic physical scenes. This is an intriguing hypothesis, that of course requires much further testing. For instance, how does this Physics Network relate to the dorsal attention network? This, together with the fact that unstable scenes evoked more activity than stable ones, could make one wonder whether we're simply looking at stronger vs. weaker attention (or engagement) rather than something specific to the physics of the scene. However, eye-movement analyses, as well as the perilous-vs-non-perilous condition as a control for arousal and attention, help take care of these concerns.

    It will also be important in future to dissociate effects of physical stability from effects of implied motion. In the current study, a higher response to unstable vs stable scenes in reported in motion area MT, which the authors describe in terms of implied motion. It therefore seems possible to describe the results in the "Physics Network" in terms of implied motion, rather than physical stability, as well.

    In short, much future work remains to be done to ascertain what computations exactly take place in this Physics Network during inference on physical stability, but the current study represents an intriguing step forward in this domain.

  4. Reviewer #2 (Public Review):

    This paper combines neuroimaging, behavioral experiments, and computational modeling to argue that (a) there is a network of brain areas that represent physical stability, (b) these areas do so in a way that generalizes across many kinds of instability (e.g., not only a tower of blocks about to fall over, but also a person about to fall off a ladder), and (c) that this supports a simulation account of physical reasoning, rather than one based on feedforward processing; this last claim arises through a comparison of humans to CNNs, which do an OK job classifying physical instability but not in a way that transfers across these different stability classes. In my opinion, this is a lovely contribution to the literatures on both intuitive physical reasoning and (un)humanlike machine vision. At the same time, I wasn't sure that the broader conclusions followed from the data in the way the authors preferred, and I also had some concerns about some of the methodological choices made here.

    1. The following framing puzzled me a bit, and even seemed to raise an unaddressed confound in the paper: "Here we investigate how the brain makes the most basic prediction about the physical world: whether the situation in front of us is stable, and hence likely to stay the same, or unstable, and hence likely to change in the immediate future".

    Consider the following minor worry, which sets up a more major one: This framing, which connects 'stability' to 'change' and which continues throughout the paper, seems to equivocate on the notion of 'stability'. One meaning of 'stable' is, roughly, 'unchanging'. Another meaning is 'unlikely to fall over'. The above quotation, along with others like it, makes it seem like the authors are investigating the former, since that's the only meaning that makes this quotation make sense. But in fact the experiments are about the latter -- towers falling down, people falling off ladders, etc. But these aren't the same thing! So there's a bit of wordplay happening here, it seemed to me.

    This sets up the more serious worry. As this framing reveals, unstable scenes (in the likely-to-fall-over sense) are, by their nature, scenes where something is likely to change. In that case, how do we know that the brain areas this project has identified aren't representing 'likeliness to change', rather than physical stability? There are, of course, many objects and scenes that might be highly likely to change without being at all physically unstable. Even the first example in the paper ("a dog about to give chase") is about likely changes without any physical instability. But isn't this a confound? All of the examples of physical instability explored here also involve likeliness to change! So these could be 'likely to change' brain areas, not 'physically unstable' brain areas. Right? Or if not, what am I missing?

    The caption of Figure 1 seems to get at this a bit, but in a way I admit I just found a bit confusing. If authors do after all intend "physically unstable" to mean "likely to change", then many classes of scenarios that are unexplored here seem like they would be relevant: a line of sprinters about to dash off in a race, someone about to turn off all the lights in a home, a spectacular chemical reaction about to start, etc. But the authors don't intend those scenarios to fall under the current project, right?

    (Also: Is stability really 'the most basic prediction' we make about the world? Who is to say that stable vs. unstable is a more basic judgment than, say, present vs. absent, or expected vs. unexpected, or safe vs. unsafe, etc? I know this is mostly just trying to get the reader excited about the results, but I stumbled there.)

    2. Laying out these issues in terms of feedforward processing vs. simulation felt a bit misleading and/or unfair to those views, given the substance of what this paper is actually doing. In particular, the feedforward view ends up getting assimilated to "what CNNs do"; but these are completely different hypotheses (or at least can be). Note, for example, that many vision researchers who don't think CNNs are good models of human vision nevertheless do think that lots of what human vision does is feedforward; that view could only be coherent if there are kinds of feedforward processing that are un-CNN-like. It would be better not to conflate these two and just say that the pattern of results rules out CNN-like feedforward processing without ruling out feedforward processing in general.

    3a. I wasn't sure how impressed to be by the fact that, say, 60% classification accuracy one class of stable/unstable scenes doesn't lead to above-chance performance on another class of stable/unstable scenes. Put differently, it seems that the CNNs simply didn't do a great job classifying physical stability in the first place; in that case, how general should we expect their representations to be anyway? Now, on one hand, I could see this worry only further supporting the authors' case, since you could think of this as all the more evidence that CNNs won't have representations of stability in them. But since (a) the claims the authors are making are about feedforward processing in principle, not just in one or two CNNs, and (b) the purpose of this paper is to explore the issue of generality per se, rather than just stability, this seems inadequate. It could be that a CNN that does achieve high accuracy on physical stability judgments (90%?) would actually show this kind of general transfer; but we don't know that from the data presented here, because it's possible that the lack of generality arises from poor performance to begin with.

    3b. I wasn't sure how to think about whether showing CNNs stable and unstable scenes is a fair test of their ability to represent physical stability. Do we know that stability is all that these images have in common? Maybe the CNN is doing a great job learning some other representation. This sort of thing comes up in some recent discussions of 'shortcuts' and/or the 'fairness' of comparisons between human and machine vision, including some recent theoretical papers (see author recommendations for specific suggestions here).

    4a. I didn't really follow this passage, which is relied on to interpret greater activity for unstable vs stable scenes: "we reasoned that if the candidate physics regions are engaged automatically in simulating what will happen next, they should show a higher mean response when viewing physically unstable scenes (because there is more to simulate) than stable scenes (where nothing is predicted to happen)." It seems true enough that, once one knows that a scene is stable, one doesn't then need a dynamically updated representation of its unfolding. But the question that this paper is about is how we determine, in the first place, that a scene is stable or not. The simulations at issue are simulations one runs before one knows their outcome, and so it wasn't clear at all to me that there is always more to simulate in an unstable scene. Stable scenes may well have a lot to simulate, even if we determine after those hefty simulations that the scene is stable after all. And of course unstable scenes might well have very little to simulate, if the scene is simple and the instability is straightforwardly evident. Can the authors say more about why it's easier to determine that a stable scene is stable than that an unstable scene is unstable? They may have a good answer! It would just be better to see it in the paper.

    4b. I was confused a bit by the Animals-People condition, and whether to think of it as a control condition or not. The image of it in Figure 1a makes it seem like it is meant to be interpreted along the usual "physical stability" lines, just like falling towers and people on ladders, and the caption seems to say this too; it also makes intuitive sense since the man in the boat looks like he'll fall if and when the alligator attacks. But then in the main text the authors predict that the representations of stability would not extend to the Animals-People condition, because they are just supposed to be about peril but not stability. Why not? And then the results themselves are equivocal, with some findings generalizing to Animals-People and some not. I don't have much more to say here other than that I found this hard to follow.

    5. "Interestingness" ratings felt like a not-quite-adequate approach for evaluating how attention-grabbing the towers were. A Bach concerto is more interesting than a gunshot (and would be rated that way, I imagine), but the gunshot is surely more attention-grabbing. Why not use a measure like how much they distract from another task? That's the sort of thing I'd have expected, in any case.

  5. Reviewer #3 (Public Review):

    The present study by Pramod et al. argues for the hypothesis that certain fronto-parietal regions assess physical stability via forward simulation rather than via pattern matching. This work follows recent studies (e.g., Fischer et al., 2016, and Schwettmann et al., 2019) that examined the human ability to reason intuitively about the physics of objects in the environment. Using fMRI pattern analyses, the authors of the present work find evidence of generalizable representation of scene stability in the fronto-parietal areas identified by Fischer et al. (2016), but not in the ventral visual pathway, known for pattern matching and object recognition. This fronto-parietal representation does not extend to "unstable" scenarios that depicted animals, ruling out the possibility that these regions are simply responding to more dangerous or more interesting scenes. Importantly, the authors find that a convolutional neural network (CNN) trained to assess physical stability in one of three scene types is not able to accurately assess stability in the other two scene types, lending further support to the argument that assessment of physical stability cannot be achieved merely via pattern matching, at which CNNs excel. The authors suggest that the brain, in its processing of physical variables, is more like a physics engine than like a CNN.

    This manuscript is well-written, the research has sound methodology and experimental design, and the figures convey the authors' argument effectively. The results add to recent investigations of physical reasoning and its representation in the brain. More broadly, this work informs recent speculation about how much (and under what circumstances) the brain relies on generalizable pattern matching processes similar to those utilized by statistically driven machine learning models.

    Put in the context of previous work by members of the same research group, the key addition of the present work is that predictions of stability are tested using both fMRI pattern analyses in dorsal and ventral cortex (as in prior work), and using pattern-recognition machine learning with a CNN (a new addition). The authors reference studies indicating that CNNs trained on assessing stability of block towers may be able to generalize to slightly different block towers. However, since previous research had only shown CNNs capable of assessing physical stability in scenarios relatively similar to those which they had been trained on, the failure of the CNN in the present study to generalize across very different physical scenarios is not necessarily surprising. Nonetheless, comparing CNN performance to human brain responses towards the same stimuli is a worthwhile paradigm for investigating the brain's usage of pattern matching versus more specialized processing.

    One major concern with the interpretation of the findings, however, is that while the stimuli in the Animals-People condition seem to be relatively straightforward to understand whether they are "unstable" or "stable" scenes, the stimuli in the Physical-People condition seem more difficult to comprehend when they are "unstable" versus "stable", perhaps due to their unusual scene composition. For instance, it can take a moment to understand what is going on in the scene depicting a man precariously perched atop a ladder, with the ladder balanced on one leg over a stairwell (Figure 2A). This leaves open the possibility that the elevated activity observed in the fronto-parietal areas may not reflect forward simulation of unstable (over stable) scenes involving physical objects and people, but instead, the requirements of parsing difficult-to-comprehend scenes (which happen to be unstable and physical) over straightforward scenes (which are either stable or involve animals). Put simply, the fMRI responses recorded in fronto-parietal cortex may have more to do with scene comprehension than with physical stability.