Trial-by-trial predictions of subjective time from human brain activity

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Human experience of time exhibits systematic, context-dependent deviations from veridical clock time; for example, time is experienced differently at work than on holiday. Here we test the proposal that differences from clock time in subjective experience of time arise because time estimates are constructed by accumulating the same quantity that guides perception: salient events. Healthy human participants watched naturalistic, silent videos of up to ∼1 minute in duration and estimated their duration while fMRI was acquired. We were able to reconstruct trial-by-trial biases in participants’ duration reports, which reflect subjective experience of time (rather than veridical clock time), purely from salient events in their visual cortex BOLD activity. This was not the case for control regions in auditory and somatosensory cortex, despite being able to predict clock time from all three brain areas. Our results reveal that the information arising during sensory processing of our dynamic environment provides a sufficient basis for reconstructing human subjective time estimates.

Article activity feed

  1. ###Reviewer #3:

    This is a potentially interesting work that addresses a key question in the temporal cognition field: how perceived duration is represented in the human brain. I found the manuscript well written, the methodology used sound. Analysis-wise, the authors make a big effort to model the fMRI data in several ways. They even use an artificial network model to show that via accumulation of salient events it is possible to mimic human duration perception.

    Despite this big effort though I found the results and a few aspects of the analysis not entirely convincing.

    Below I list my comments:

    1. The authors talk about salient events and accumulation of them. But what are these events? Are they moving objects, changes of edges or luminance? I feel that a better characterization of the visual properties of the stimuli is missing here. This information is important also to better understand the events underlying the BOLD change. According to the authors, perceived time is a function of the BOLD changes associated with these events. It is therefore crucial to tell what these events actually are. Can we consider eye movements salient events?

    2. The authors record eye movements but as far as I read in the manuscript they do not incorporate this information in any of the analyses. Do eye movements correlate with the predicted bias and/or with the human bias?

    I think the results would greatly benefit from a better specification of the type of events leading to brain changes and consequently to duration perception.

    1. I found it puzzling that BOLD changes in auditory and somatosensory cortices predict physical time. How is this possible? Is there a brain area where physical duration cannot be predicted?

    2. A bit disappointing is the lack of differences in predicting perceived time of the different visual layers. The result suggests that any accumulated change in visual cortex activity leads to perceptual bias. I think it is very unlikely that different parts of the visual stream contribute in the same way to duration perception.

    3. The model prediction works for the two algorithms used to quantify BOLD changes. If I understand correctly, we cannot tell whether it is a difference in change or it is the change itself that leads to duration bias. I found this aspect of the results also not very informative.

    4. In how many subjects was it possible to actually predict perceived duration from BOLD activity? A clearer picture on how the model works in individual subjects would be more convincing.

  2. ###Reviewer #2:

    Sherman et al seek to understand the basis of human time perception using a combination of psychophysics, computational modeling, and fMRI. This work builds on previously published work by the same group (Roseboom, Nature Communications 2019) showing that integrated changes in the state of (a) deep image classification network(s) during the presentation of movies predicted aspects of human timing reports. In that study, similar to what is shown in the current manuscript, timing biases were found in human behavior for different movie scene types, for example, city, natural scenes, or offices. Interestingly, similar biases were found in the timing estimates produced by their integrated deep network state change procedure. They interpret these findings as evidence that estimates of duration are derived from changes in the state of perceptual networks, in this case presumably those involved in visual perception. I find this previous work to be an important contribution toward understanding how the brain constructs information about a fundamental dimension of the environment for which there are no obvious sensors.

    In the current study, the authors repeat many of the steps contained in the previous publication, but in the context of humans estimating the duration of silent movies while positioned in an MRI scanner. They compute BOLD signals during movie viewing using a set of techniques I am not intimately familiar with because I do not use MR to assess brain activity in my own research, but which seem standard from what I can tell. They then treat the voxel by voxel BOLD measures similarly to the manner they did nodes in the deep network, and show that estimates derived from visual cortices may correlate with human biases and effects of scene type, but not those estimates derived from voxels in auditory or somatosensory cortices. While I have some technical questions, I find the work to be overall well reasoned and clearly presented. My major issue with the paper has to do with the fact that given their previous publication already showed that human behavior exhibits timing biases that correlate with the rate of change in visual scenes, and what we know about the localization of modality specific sensory function in cortex, it would be worrying if they could not derive time estimates from a measure of neural activity in visual cortex. It seems that the core hypothesis they are testing has to do with whether one can extract a measure of change in visual scenes from BOLD signals recorded in the visual cortex. Finding that one can indeed do so doesn't seem particularly surprising and thus represents a relatively incremental advance relative to what was known before. In terms of novelty, what we are left with then is the observation that the use of different metrics on BOLD changes per voxel to estimate elapsed time differ with respect to their ability to reproduce timing biases by scene type. However, clarification is needed regarding how they compute these metrics to fully assess the importance of these differences.

    The authors state that they compute Euclidian distance between voxel activations from TR to TR. However, it looks like they are computing the L1 norm of the differences, or the manhattan/city block distances. Which is it?

    Why should the sum of signed differences provide a different result? Is it that in the distance measurement, noise is accumulated in the measure over voxels whereas in the signed difference this noise is canceled out by averaging? Some amount of intuition would be helpful.

    Writing level comments:

    1. Regarding the framing and discussion of the experiments, I am not sure why the authors see their results as incompatible with and not complementary to some of the existing proposals for time encoding in the brain. For example, the impact of sensory change on responses in perceptual networks might very well have an influence on dynamics of downstream neural populations, potentially through neuromodulators, so I don't see the obvious incompatibility. This is not to say that the authors are not addressing an important problem, namely why does sensory change bias timing reports.

    For example, I think this statement is a bit inaccurate and unnecessary:

    "...This end-to-end account of time perception represents a significant advance over homuncular accounts that depend on "clocks" in the brain. "

    1. I wouldn't say their work represents an "end to end" account of time perception, and certainly not an end to end account of the behavior they are studying. What happens in more naturalistic situations where people are moving, and taking in other sensory modalities? How does this time perception information get transformed into the behavioral report of individuals, for example? The authors don't need to over-reach for the work to be interesting. The authors would also seem to be implying that the previously cited studies assume a specialized clock somewhere, where in fact Tsao et al and Soares et al at least are explicitly saying the opposite, and from my perspective the field views the idea of explicit "clocks" as a bit antiquated, and rather that timing is an emergent property of the functions that neural circuits are optimized to perform... an idea that seems compatible with the authors' work.
  3. ###Reviewer #1:

    In this manuscript, Sherman and colleagues present videos of natural scenes and measure the fMRI responses of visual cortex. The addition of fMRI data aims to link both perceived duration and neural network activity differences to a common neural substrate, the sensory cortex. The authors propose that this therefore shows "the processes underlying subjective time have their neural substrates in perceptual and memory systems, not systems specialized for time itself". I generally appreciate the aim of providing an integrated account linking duration perception to specific neural substrates, and moving away from non-specific clock models. I also appreciate the pre-registration and open science principles throughout the manuscript. However, the fMRI results described here are unsurprising and can be seen as replicating other recent findings (outside the field of timing).

    Furthermore, the links between (previously described) deep network results and the fMRI results are unconvincing. Finally, a lot is made of the role of predictive coding, but no role is convincingly demonstrated as there is no attempt to distinguish this from differences in low-level features between stimuli.

    1. The hypothesis that office and city videos produce different response amplitudes in early visual cortex is consistent with the difference in their perceived duration, but these videos seem likely to differ in many low-level properties. Most obviously, they are likely to differ in temporal frequency and the duration of events they contain. The manuscript proposes the difference in their response reflects surprise or prediction error. But this proposal is not tested. Recent studies using entirely predictable stimuli that differ in event frequency and duration (Stigliani, Jeska, & Grill-Spector, 2017, PNAS) show that these low-level features strongly affect the response of early visual areas.

    2. Similarly, a difference between network states on consecutive frames also seems likely to reflect the frequency of changes, regardless of whether these are regular and predictable or irregular and unpredictable. Again, no effort is made to distinguish between event frequency and predictability.

    3. In the conclusion, the main conceptual contribution of the manuscript is described as follows: "we have taken a model-based approach to describe how sensory information arriving in primary sensory areas is transformed into subjective time." The abstract contains a similar statement: "providing a computational basis for an end-to-end account of time perception". I appreciate the attempt to introduce a quantitative model-based approach, but the network model proposed doesn't even attempt to be biologically plausible. As such, it cannot "describe how sensory information arriving in primary sensory areas is transformed into subjective time". Specifically, the measure of Euclidian distance between network states in a feedforward network that analyses each frame independently is clearly not biologically plausible. Neural systems don't make such calculations. Instead, this represents a mathematical abstraction of more complex recurrent processes that are not included in the model. As a result, this conclusion (and similar statements elsewhere) seems to overstate the conceptual advance. To me, the results instead confirm that subjective time, sensory cortex activity and deep network activity are affected by sensory stimulus content.

    4. The framework linking the fMRI response of early visual cortex to neural network simulations is primarily a larger response of both to busy city scenes than office scenes. In both data sets, this difference is unsurprising and has been shown in previous comparisons of various quickly and slowly changing stimuli (for fMRI) and these exact scene types (for neural networks). But as the fMRI response amplitude difference is based on a binary comparison, any number of explanations could be given for why the two responses change in the same direction. An unexpected and quantitative shared effect would convincingly link the two effects seen, but an expected and qualitative change in the same direction does not.

    5. The analysis that looks for correlated differences in fMRI responses and subjective duration perception within a scene type (from line 300) is more convincing that sensory cortex responses are linked to subjective duration. However, this analysis does not link fMRI responses and deep network responses, and again changes in both fMRI responses and subjective duration are already known to reflect low-level features like visual motion and event frequency. So it's unclear whether differences in video properties (within the same class) underlie the correlated differences between fMRI responses and subjective duration, and whether the deep network models predict such effects.

    6. The word 'time' is used throughout the manuscript in a very general way. Time is a broad concept, with many different aspects and scales, from sub-second to circadian to seasonal. This study's scope does not include most of these aspects and scales, so the use of this general term 'time' overstates the broadness of the findings. Here it is used to mean 'duration in the tens of seconds'. Please specify more precisely what you mean.

  4. ##Preprint Review

    This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 4 of the manuscript.

    ###Summary:

    The reviewers appreciated the approach of your study, both in terms of the theoretical framework and in terms of the methodology. However, the reviewers were not convinced that the presented results reveal convincing evidence for neural substrates of perceived event duration. They noted that there are several alternative explanations for the effects observed, reflecting uncontrolled differences between events that are known to drive visual cortex activity (e.g., in low-level features, rate of change, or eye movements).