Neuroscout, a unified platform for generalizable and reproducible fMRI research

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This paper introduces Neuroscout, a new web-based platform for the analysis of fMRI data with a particular focus on naturalistic stimuli. It describes a new tool that will potentially be of great use to the neuroimaging community, and whose development is already quite mature and has a number of datasets ready to use online. Neuroscout as a tool will be of particular interest to neuroimagers and cognitive neuroscientists, but the conclusions drawn using the tool should be of interest to neuroscientists more broadly.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1, Reviewer #2 and Reviewer #3 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Functional magnetic resonance imaging (fMRI) has revolutionized cognitive neuroscience, but methodological barriers limit the generalizability of findings from the lab to the real world. Here, we present Neuroscout, an end-to-end platform for analysis of naturalistic fMRI data designed to facilitate the adoption of robust and generalizable research practices. Neuroscout leverages state-of-the-art machine learning models to automatically annotate stimuli from dozens of fMRI studies using naturalistic stimuli—such as movies and narratives—allowing researchers to easily test neuroscientific hypotheses across multiple ecologically-valid datasets. In addition, Neuroscout builds on a robust ecosystem of open tools and standards to provide an easy-to-use analysis builder and a fully automated execution engine that reduce the burden of reproducible research. Through a series of meta-analytic case studies, we validate the automatic feature extraction approach and demonstrate its potential to support more robust fMRI research. Owing to its ease of use and a high degree of automation, Neuroscout makes it possible to overcome modeling challenges commonly arising in naturalistic analysis and to easily scale analyses within and across datasets, democratizing generalizable fMRI research.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    This manuscript by de la Vega and colleagues describes Neuroscout, a powerful and easy-to-use online software platform for analyzing data from naturalistic fMRI studies using forward models of stimulus features. Overall, the paper is interesting, clearly written, and describes a tool that will no doubt be of great use to the neuroimaging community. I have just a few suggestions that, if addressed, I believe would strengthen the paper.

    Major comments

    1. How does Neuroscout handle collinearity among predictors for a given stimulus? Does it check for this and/or throw any warnings? In media stimuli that have been adopted for neuroimaging experiments, low-level audiovisual features are not infrequently correlated with mid-level features such as the presence of faces on screen (see Grall & Finn, 2022 for an example involving the Human Connectome Project video clips). How to disentangle correlated features is a frequent concern among researchers working with naturalistic data.

    We agree with the reviewer that collinearity between predictors is one of the biggest challenges for naturalistic data analysis. However, absent consensus on how to best model these data, we find that it is out of scope of the present report to make strong recommendations. Instead, our goal was to design an agnostic platform that would enable users to thoughtfully design statistical models for their particular goal. Papers such as Grall & Finn (2022) will be critical in advancing the debate on how to best analyze and interpret such data.

    We explicitly address this challenge in a new paragraph in the discussion under “Challenges and future directions:

    “A major challenge in the analysis of naturalistic stimuli is the high degree of collinearity between features, as the interpretation of individual features is dependent on co-occurring features. In many cases, controlling for confounding variables is critical for the interpretation of the primary feature— as is evident in our investigation of the relationship between FFA and face perception. However, it can also be argued that in dynamic narrative driven media (i.e. films and movies), the so-called confounds themselves encode information of interest that cannot or should not be cleanly regressed out (Grall & Finn, 2022).[…] Absent a consensus on how to model naturalistic data, we designed Neuroscout to be agnostic to the goals of the user and empower them to construct sensibly designed models through comprehensive model reports. An ongoing goal of the platform—especially as the number of features continues to increase—will be to expand the visualizations and quality control reports to enable users to better understand the predictors and their relationship. For instance, we are developing an interactive visualization of the covariance between all features in Neuroscout that may help users discover relationships between a predictor of interest and potential confounds.” (pg. 11)

    Note we shortened the second paragraph of the discussion by two sentences as it had touched on this subject, and was better addressed separately.

    In addition, we ensured to highlight the covariance structure visualization in the Results section:

    “At this point, users can inspect the model through quality-control reports and interactive visualizations of the design matrix and predictor covariance matrix, iteratively refining models if necessary.” (pg. 3)

    1. On a related note, do the authors and/or software have opinions about whether it is moreappropriate to run several regressions each with a single predictor of interest or to combine all predictors of interest into a single regression? (Or potentially a third, more sophisticated solution involving variance partitioning or another technique to [attempt to] isolate variance attributable to each unique predictor?) Does the answer to this depend on the degree of collinearity among the predictors? Some discussion of this would be helpful, as it is a frequent issue encountered when analyzing naturalistic data.

    This is a very sensitive methodological point, but one for which it is hard to find a univocal answer in the literature. While on the one hand it can be deceptive to model a single feature in isolation (as illustrated by our face perception analyses), more complex models pose different challenges in terms of robust parameter estimation and variance attribution. Resolving these challenges goes beyond the scope of our work, and it is ultimately our goal to provide a flexible tool which will enable these types of investigations, and enable users to take responsibility and provide motivations for methodological choices made using the platform. We touch on Neuroscout’s agnostic philosophy on this issue under “Challenges and future directions” (pg. 11; quoted above).

    However, we also agree that in part the solution to this problem will be methodological. This is particularly true for modeling deep learning based embeddings, which can have hundreds of features in a single model. We are currently working on expanding beyond traditional GLM models in Neuroscout, opening the door to more sophisticated variance partitioning techniques, and more robust parameter estimation in complex models. We highlight current and future efforts to expand Neuroscout’s statistical models in the following paragraph:

    “However, as the number of features continues to grow, a critical future direction for Neuroscout will be to implement statistical models which are optimized to estimate a large number of covarying targets. Of note are regularized encoding models, such as the banded-ridge regression as implemented by the Himalaya package. These models have the additional advantage of implementing feature-space selection and variance partitioning methods, which can deal with the difficult problem of model selection in highly complex feature spaces such as naturalistic stimuli. Such models are particularly useful for modeling high-dimensional embeddings, such as those produced by deep learning models. Many such extractors are already implemented in pliers and we have begun to extract and analyze these data in a prototype workflow that will soon be made widely available. “ (pg. 11)

    1. What the authors refer to as "high-level features" - i.e., visual categories such as buildings,faces, and tools - I would argue are better described as "mid-level features", reserving the term "high-level" for features that are present only in continuous, engaging, narrative or narrative-like stimuli. Examples: emotional tone or valence, suspense, schema for real-world situations, other operationalizations of a narrative arc, etc. After all, as the authors point out, one doesn't need naturalistic paradigms to study brain responses to visual categories or single-word properties. Much of the work that has been done so far with forward models of naturalistic stimuli has been largely confirmatory (e.g., places/scenes still activate PPA even during a rich film as opposed to a serial visual presentation paradigm). This is a good first step, but the promise of naturalistic paradigms is ultimately to go beyond these isolated features toward more holistic models of cognitive and affective processes in context. One challenge is that extracting true high-level features is not easily automated, although the ability to crowdsource human ratings using online data collection has made it feasible to create manual annotations. However, there are still technical challenges associated with collecting continuous-response measurement (CRM) data during a relatively long stimulus from a large number of individuals online. Does Neuroscout have any plans to develop support for collecting CRM data, perhaps through integration with Amazon MTurk and/or Prolific? Just a thought and I am sure there are a number of features under consideration for future development, but it would be fabulous if users could quickly and easily collect CRM data for high-level features on a stimulus that has been uploaded to Neuroscout (and share these data with other end users).

    The reviewer makes a very good point regarding the fact that many so-called “high-level” features are best called “mid-level”. As such, we have changed our use of “high-level” to “mid-level perceptual features” throughout the manuscript.

    “Currently available features include hundreds of predictors coding for both low-level (e.g., brightness, loudness) and mid-level (e.g., object recognition indicators) properties of audiovisual stimuli…” (pg. 3)

    That said, we do believe that as machine learning (and in particular deep learning) models evolve, it will become more feasible to extract higher level features automatically. This has already been shown with transformer language models, which are able to extract higher-level semantic information from natural text. To this end, we have ensured to design our underlying feature extraction platform, pliers, to be easily extensible, to ensure the continued growth of the platform as algorithms evolve. We ensure to highlight this in the Results section ‘Automated annotation of stimuli’:

    “The set of available predictors can be easily expanded through community-driven implementation of new pliers extractors, as well as public repositories of deep learning models, such as HuggingFace and TensorFlowHub. We expect that as machine learning models continue to evolve, it will be possible to automatically extract higher-level features from naturalistic stimuli.” (pg. 3)

    We also ensured to highlight the extensibility of pliers to increasingly power deep learning models in the Discussion by revising this sentence

    “As a result, we have designed Neuroscout and its underlying feature extraction framework pliers to facilitate community-led expansion to novel extractors— made possible by the rapid increase in public repositories of pre-trained deep learning models such as HuggingFace and TensorFlow Hub” (pg. 10)

    As to the point of a potential extension to Neuroscout for easily collecting crowd source stimuli annotations, we are in full agreement that this would be very useful. In fact, this feature was part of the original plan for Neuroscout, but fell out of scope as other features took priority. Although we are unsure if this extension is a short term priority for the Neuroscout team (as it likely would take substantial effort to develop a general purpose extension), the ability to submit user-generated features to the Neuroscout API should make it possible to design a modular extension to Neuroscout to collect such features.

    We mention this possibility briefly in the future directions section:

    “Other important expansions include facilitating analysis execution by directly integrating with cloud-based neuroscience analysis platforms (such as Brainlife.io) and facilitating the collection of higher-level stimulus features by integrating with crowdsourcing platforms such as MechanicalTurk or Prolific.” (pg. 11)

    1. Can the authors talk a bit more about the choice to demean and rescale certain predictors, namely the word-level features for speech analysis? This makes sense as a default step, but I wonder if there are situations in which the authors would not recommend normalizing features prior to computing the GLM (e.g., if sign is meaningful, if the distribution of values is highly skewed if the units reflect absolute real-world measurements, etc). Does Neuroscout do any normalization automatically under the hood for features computed using the software itself and/or features that have been calculated offline and uploaded by the user?

    In keeping with Neuroscout’s philosophy to be a general purpose platform, we have not performed any standardization of features. Instead, users can choose to modify raw predictor values by applying transformations on a model-by-model basis. Currently available transformations through the web interface include: scale, orthogonalize and threshold. Note that there is a wider range of transformations available in the BIDS Stats Model, but we are hesitant to advertise these yet, as they are more difficult to use.

    We revised our description of transformations in the Result section to clarify these transformations are model specific:

    “Raw predictor values can be modified by applying model-specific transformations such as scaling, thresholding, orthogonalization, and hemodynamic convolution.” (pg. 3)

    We also clarify that variables are ingested without any in-place modifications in the Methods section. The only exception is that we down-sample highly dense variables (such as those from auditory files, which can result in thousands of value per second), to save disk space:

    “Feature values are ingested directly with no in place modifications, with the exception of down sampling of temporally dense variables to 3hz to reduce storage on the server.” (pg.

    With respect to the word frequency analysis, the primary reason we scaled variables was to facilitate imputing missing values for words not found in the look-up dictionary. By scaling the variable, we were able to replace missing values with zero, effectively assigning them the average word frequency value. We clarified this strategy in the Methods section:

    “In all analyses, this variable was demeaned and rescaled prior to HRF convolution. For a small percentage of words not found in the dictionary, a value of zero was applied after rescaling, effectively imputing the value as the mean word frequency.” (pg. 17)

    On a more general note, when interpreting a single variable with a dummy coded contrast (i.e. 1 for the predictor of interest, and 0 for all other variables), it’s not necessary to normalize features prior to modeling, as fMRI t-stat maps are scale-invariant (although the parameter estimates will be affected).

    We added a note with our recommendations in the Neuroscout Documentation: https://neuroscout.github.io/neuroscout-docs//web/builder/transformations.html#scale

    Reviewer #2 (Public Review):

    The authors present a new platform for constructing and sharing fMRI analyses, specifically geared toward analyzing publicly-available naturalistic datasets using automatically-extracted features. Using a web interface, users can design their analysis and produce an executable package, which they can then execute on their local hardware. After execution, the results are automatically uploaded to NeuroVault. The paper also describes several examples of analyses that can be run using this system, showing how some classical feature-sensitive ROIs can be derived from a meta-analysis of naturalistic datasets.

    The Neuroscout system is impressive in a number of ways. It provides easy access to a number of publicly-available datasets (though I would like to see the current set of 13 datasets increase in the future), has a wide variety of machine-learning features precomputed on the video and audio features of these stimuli, and builds on top of established software for creating and sandboxing analysis workflows. Performing meta-analyses across multiple datasets are challenging both practically and statistically, but this kind of multi-dataset analysis is easy to specify using Neuroscout. It also allows researchers to easily share a reproducible version of their pipeline simply by pointing to the publicly-available analysis package hosted on Neuroscout. The platform also provides a way for researchers to upload their own custom models/predictors to extend those available by default.

    The case studies described in the paper are also quite interesting, showing that traditional functional ROIs such as PPA and VWFA can be defined without using controlled stimuli. They also show that, running a contrast for faces does not produce FFA until speech (and optionally adaptation) is properly controlled for, and that VWFA shows relationships to lexical processing even for speech stimuli.

    I have some questions about the intended workflow for this tool: is Neuroscout meant to be used for analysis development in addition to sharing a final pipeline? The fact that the whole analysis is packaged into a single command is excellent for reproducibility but seems challenging to use when iterating on a project. For example, if we wanted to add another contrast to a model, it appears that this would require cloning the analysis and re-starting the process from scratch.

    An important principle of Neuroscout from the onset of the project was to minimize undocumented researcher degrees of freedom, and maximize transparency in order to reduce the file drawer effect which can contribute to biased results in the published literature. As such, we require analyses to be registered and locked as the modal usage of our application. In the case of adding a contrast, it is true that this would require a user to clone the analysis. Although all of the information from the previous model would be encoded in the new model, this would require re-estimating the design matrix which could be time consuming. However, in our experience, users almost always add new variables to the design-matrix when a study is cloned, which would in any case require re-estimating the design matrix for all runs and subjects. We believe this trade-off is worthwhile to ensure maximal reproducibility, but also point out that since Neuroscout’s data is freely available via our API, power users could directly access the data if they need to use it in a less constrained manner.

    We believe that these important distinctions are best addressed in the newly developed Neuroscout documentation which we now reference throughout the text (https://neuroscout.org/docs/web/browse/clone.html).

    I'm also unsure about how versioning of the input datasets and the predictors is planned to be handled by the platform; if datasets have been processed with multiple versions of fmriprep, will all of those options be available to choose from? If the software used to compute features is updated, will there be multiple versions of the features to choose from?

    The reviewer makes an astute observation regarding the versions of input data (predictors & datasets). Currently we have only pre-processed the imaging data once per data, and as such this has not been an issue. However, in the long run we certainly agree this would be important to give users the ability to choose which pre-processed version of the raw dataset they want to use, as certainly there could be differing but equally valid versions. We have opened an issue in Neuroscout’s repository to track this issue, and plan to incorporate this ability in a future version (https://github.com/neuroscout/neuroscout/issues/1076).

    With respect to feature versions, every time a feature is re-extracted, a new predictor_id is generated, and the accompanying meta-data such as time of extraction is tracked for that specific version. As such, if a feature is updated and re-extracted, this will not change existing analyses. By default, we have chosen to obscure this from the user to make the user experience simpler. However, there is an open issue to expand the frontend’s ability to explicitly display different versions, and allow users to update older analyses with newer versions of features. Advanced users already have access to this functionality by using the Python API (PyNS) to directly access all features, and create analyses with more precision.

    We have made a note regarding this behavior in the Neuroscout Documentation: https://neuroscout.github.io/neuroscout-docs/web/builder/predictors.html

    I also had some difficulty attempting to test out the platform, so additional user testing may be necessary to ensure that novice users are able to successfully run analyses.

    We thank the reviewer for this bug report, which allowed us to fix a previously unnoticed issue with a subset of Neurosout datasets. We have been incontact with the reviewer to ensure that this issue was successfully addressed.

  2. Evaluation Summary:

    This paper introduces Neuroscout, a new web-based platform for the analysis of fMRI data with a particular focus on naturalistic stimuli. It describes a new tool that will potentially be of great use to the neuroimaging community, and whose development is already quite mature and has a number of datasets ready to use online. Neuroscout as a tool will be of particular interest to neuroimagers and cognitive neuroscientists, but the conclusions drawn using the tool should be of interest to neuroscientists more broadly.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1, Reviewer #2 and Reviewer #3 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    This manuscript by de la Vega and colleagues describes Neuroscout, a powerful and easy-to-use online software platform for analyzing data from naturalistic fMRI studies using forward models of stimulus features. Overall, the paper is interesting, clearly written, and describes a tool that will no doubt be of great use to the neuroimaging community. I have just a few suggestions that, if addressed, I believe would strengthen the paper.

    Major comments
    1. How does Neuroscout handle collinearity among predictors for a given stimulus? Does it check for this and/or throw any warnings? In media stimuli that have been adopted for neuroimaging experiments, low-level audiovisual features are not infrequently correlated with mid-level features such as the presence of faces onscreen (see Grall & Finn, 2022 for an example involving the Human Connectome Project video clips). How to disentangle correlated features is a frequent concern among researchers working with naturalistic data.

    2. On a related note, do the authors and/or software have opinions about whether it is more appropriate to run several regressions each with a single predictor of interest or to combine all predictors of interest into a single regression? (Or potentially a third, more sophisticated solution involving variance partitioning or another technique to [attempt to] isolate variance attributable to each unique predictor?) Does the answer to this depend on the degree of collinearity among the predictors? Some discussion of this would be helpful, as it is a frequent issue encountered when analyzing naturalistic data.

    3. What the authors refer to as "high-level features" - i.e., visual categories such as buildings, faces, and tools - I would argue are better described as "mid-level features", reserving the term "high-level" for features that are present only in continuous, engaging, narrative or narrative-like stimuli. Examples: emotional tone or valence, suspense, schema for real-world situations, other operationalizations of a narrative arc, etc. After all, as the authors point out, one doesn't need naturalistic paradigms to study brain responses to visual categories or single-word properties. Much of the work that has been done so far with forward models of naturalistic stimuli has been largely confirmatory (e.g., places/scenes still activate PPA even during a rich film as opposed to a serial visual presentation paradigm). This is a good first step, but the promise of naturalistic paradigms is ultimately to go beyond these isolated features toward more holistic models of cognitive and affective processes in context. One challenge is that extracting true high-level features is not easily automated, although the ability to crowdsource human ratings using online data collection has made it feasible to create manual annotations. However, there are still technical challenges associated with collecting continuous-response measurement (CRM) data during a relatively long stimulus from a large number of individuals online. Does Neuroscout have any plans to develop support for collecting CRM data, perhaps through integration with Amazon MTurk and/or Prolific? Just a thought and I am sure there are a number of features under consideration for future development, but it would be fabulous if users could quickly and easily collect CRM data for high-level features on a stimulus that has been uploaded to Neuroscout (and share these data with other end users).

    4. Can the authors talk a bit more about the choice to demean and rescale certain predictors, namely the word-level features for speech analysis? This makes sense as a default step, but I wonder if there are situations in which the authors would not recommend normalizing features prior to computing the GLM (e.g., if sign is meaningful, if the distribution of values is highly skewed if the units reflect absolute real-world measurements, etc). Does Neuroscout do any normalization automatically under the hood for features computed using the software itself and/or features that have been calculated offline and uploaded by the user?

  4. Reviewer #2 (Public Review):

    The authors present a new platform for constructing and sharing fMRI analyses, specifically geared toward analyzing publicly-available naturalistic datasets using automatically-extracted features. Using a web interface, users can design their analysis and produce an executable package, which they can then execute on their local hardware. After execution, the results are automatically uploaded to NeuroVault. The paper also describes several examples of analyses that can be run using this system, showing how some classical feature-sensitive ROIs can be derived from a meta-analysis of naturalistic datasets.

    The Neuroscout system is impressive in a number of ways. It provides easy access to a number of publicly-available datasets (though I would like to see the current set of 13 datasets increase in the future), has a wide variety of machine-learning features precomputed on the video and audio features of these stimuli, and builds on top of established software for creating and sandboxing analysis workflows. Performing meta-analyses across multiple datasets are challenging both practically and statistically, but this kind of multi-dataset analysis is easy to specify using Neuroscout. It also allows researchers to easily share a reproducible version of their pipeline simply by pointing to the publicly-available analysis package hosted on Neuroscout. The platform also provides a way for researchers to upload their own custom models/predictors to extend those available by default.

    The case studies described in the paper are also quite interesting, showing that traditional functional ROIs such as PPA and VWFA can be defined without using controlled stimuli. They also show that, running a contrast for faces does not produce FFA until speech (and optionally adaptation) is properly controlled for, and that VWFA shows relationships to lexical processing even for speech stimuli.

    I have some questions about the intended workflow for this tool: is Neuroscout meant to be used for analysis development in addition to sharing a final pipeline? The fact that the whole analysis is packaged into a single command is excellent for reproducibility but seems challenging to use when iterating on a project. For example, if we wanted to add another contrast to a model, it appears that this would require cloning the analysis and re-starting the process from scratch. I'm also unsure about how versioning of the input datasets and the predictors is planned to be handled by the platform; if datasets have been processed with multiple versions of fmriprep, will all of those options be available to choose from? If the software used to compute features is updated, will there be multiple versions of the features to choose from? I also had some difficulty attempting to test out the platform, so additional user testing may be necessary to ensure that novice users are able to successfully run analyses.

  5. Reviewer #3 (Public Review):

    Considerable progress has been made in moving to more open and reproducible fMRI research. However, an accessible end-to-end solution that meets these standards has remained elusive, in part because it requires the combination of many tools. Neuroscout aims to try to provide this platform. Key elements of Neuroscout include:

    - An easy-to-use web application for designing the GLM analysis of naturalistic experiments;
    - Data ingestion server with a growing repository of naturalistic fMRI studies curated and preprocessed for these analyses;
    - Feature extraction server for the generation of different regressors for analyses;
    - Tooling for implementing these analyses;
    - Automated generation of citations for these analyses.

    This platform has no clear precedents, is reasonably mature, is easy to use, and has an impressive number of curated datasets. With a focus on large naturalistic datasets, there should be a wide range of legitimately novel analyses that are made easily accessible with this tool, and this will increase as Neuroscout evolves to offer a wider range of datasets and functionality. A key benefit of easy-to-use platforms of this nature is that researchers gain the ability to quickly implement analyses of phenomena and hypotheses generated from their own work, accelerating science. Documentation and data and code accessibility are excellent. The existing analysis examples are interesting, accessible to users, and generally provide good insight into the use and value of the platform for general users.

    A weakness of many automated systems of this nature is that users rapidly find limitations in the types of analyses that can be set up. In the worst cases, this leaves the platform providing largely a demonstration. However, here, the well-developed open-science components make this unlikely. The authors have strong records in developing widely used open software for fMRI, and the considerable number of datasets and feature-generation algorithms that have been integrated into the platform already bodes well for uptake. Nevertheless, while described as end-to-end, the current scope for analysis design is somewhat limited, restricted largely to the specification of the GLM design. Furthermore, it was not clear if or how the platform might scale and develop an active community of data, algorithm, and code contributors. Similarly, choices of preprocessing algorithms are not extensively motivated, and how these might evolve with input from a wider community is unclear.

    Overall this is a promising tool that develops upon a burgeoning set of open-science tools for functional neuroimaging and presents new strategies for how fMRI analysis can be made more accessible and reproducible. While a software tool's success is ultimately measured by its uptake, Neuroscout presents a successful implementation of a concept that may provide researchers with or without extensive experience of fMRI the ability to efficiently implement novel analyses to a high standard. If Neuroscout is to be a success, it would be expected to evolve considerably from its current state. Determining how to balance the flexibility of the tool with ease of use will be an ongoing challenge.