Gated recurrence enables simple and accurate sequence prediction in stochastic, changing, and structured environments

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    There has been a longstanding interest in developing normative models of how humans handle latent information in stochastic and volatile environments. This study examines recurrent neural network models trained on sequence-prediction tasks analogous to those used in human cognitive studies. The results demonstrate that such models lead to highly accurate predictions for challenging sequences in which the statistics are non-stationary and change at random times. This is a novel and remarkable result that opens up new avenues for cognitive modelling.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

From decision making to perception to language, predicting what is coming next is crucial. It is also challenging in stochastic, changing, and structured environments; yet the brain makes accurate predictions in many situations. What computational architecture could enable this feat? Bayesian inference makes optimal predictions but is prohibitively difficult to compute. Here, we show that a specific recurrent neural network architecture enables simple and accurate solutions in several environments. This architecture relies on three mechanisms: gating, lateral connections, and recurrent weight training. Like the optimal solution and the human brain, such networks develop internal representations of their changing environment (including estimates of the environment’s latent variables and the precision of these estimates), leverage multiple levels of latent structure, and adapt their effective learning rate to changes without changing their connection weights. Being ubiquitous in the brain, gated recurrence could therefore serve as a generic building block to predict in real-life environments.

Article activity feed

  1. Evaluation Summary:

    There has been a longstanding interest in developing normative models of how humans handle latent information in stochastic and volatile environments. This study examines recurrent neural network models trained on sequence-prediction tasks analogous to those used in human cognitive studies. The results demonstrate that such models lead to highly accurate predictions for challenging sequences in which the statistics are non-stationary and change at random times. This is a novel and remarkable result that opens up new avenues for cognitive modelling.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)

  2. Reviewer #1 (Public Review):

    This study examines recurrent neural network models trained on sequence-prediction tasks analogous to those used in human cognitive studies. The results demonstrate that such models lead to highly accurate predictions for challenging sequences in which the statistics are non-stationary and change at random times. This is a novel and remarkable result that opens up new avenues for cognitive modelling.

    Strengths:

    - Trained artificial networks are probed using tasks and analysis tools identical to human cognition.

    - The results show that trained recurrent networks exhibit effective learning rates that adapt to sudden shifts in statistics, in a manner similar to optimal Bayesian agents.

    - Thorough analyses demonstrate that the hidden states of the networks encode emergent latent variables such as the precision of the prediction, or current "context".

    - Very clear writing style.

    Weaknesses:

    - The manuscript insists on the fact that gating between neural units is a necessary component. A more conservative conclusion is that gating facilitates training. The analysis in the manuscript cannot exclude the possibility that networks without explicit gating could reach an optimal performance with a different training protocol.

    - The text insists on the fact that a very small number of neural units is sufficient (11 in most figures). It is not clear why this is a relevant limit when comparing with biological networks.

    Despite these weaknesses, the results strongly support the conclusions. Most importantly, this study opens up the possibility of using a new class of models for cognitive modelling.

  3. Reviewer #2 (Public Review):

    This manuscript examines the suitability of a specific class of artificial neural network models with 'gated recurrent units (GRUs) for solving the general problem of prediction in a stochastic volatile environment. The authors frame the problem in the context of predicting the next sample in a sequence of 0's and 1's generated from increasingly more complex generative models. They compare the performance of the GRU network to that of an optimal observer as well as various heuristic solutions and reduced variants of the GRU. Results indicate that full GRU networks are closer to optimal than heuristic models and that all their key computational building blocks (gating, lateral connections, and recurrent weight tuning) contribute to their near-optimal performance.

    There has been a longstanding interest in developing normative models of how humans handle latent information in stochastic and volatile environments. The paper is important and timely as it leverages recent advances in machine learning to tackle this question using a different approach that involves task-optimized neural network models. This approach has proven quite fruitful in several behavioral domains including perception, motor behavior, timing, and simple decision-making. The paper builds on this growing body of work to generate hypotheses about the computational building blocks that may underlie quasi-optimal behavior in the presence of stochasticity and volatility.

    Strengths:

    The paper is very well written and accessible despite being computationally quite sophisticated.

    The approach is systematic. It uses a well-defined task (binary sequence prediction) that can be adapted to increasingly more complex latent structures without changing the observables.

    The analyses are comprehensive. The behavior of the network models is compared to various optimal and suboptimal observer models as well as reduced versions of the network models. This overall approach is maintained throughout the paper so that one can appreciate the key takeaway points.

    The paper aims to offer a deeper understanding and not just an engineering solution. Many papers of this kind do not take the effort to 'open' the trained networks and provide an algorithmic understanding. This paper does. More impressively, the paper goes beyond simple correlative measures of the network states; it uses an innovative perturbation technique to verify that the inferred latent representations in the network are indeed functional. I really liked this perturbation analysis. It is highly valuable and broadly applicable.

    By comparing the full GRU with its reduced versions, the paper makes a serious attempt to guard against criticisms that the success of the full GRU is due to the degrees of freedom it offers.

    Weaknesses:

    The most notable weakness of the paper is that is not clear whether its aim is to develop a neural model that is close to optimal or a neural model that explains how biological brains handle stochasticity and volatility. There is no serious and quantitative comparison to behavior or neural data recorded in humans or animal models. All the comparisons are with other algorithms and reduced GRU networks. One can appreciate these comparisons if the goal is to show that a full GRU network is close to optimal (which as they show, in many cases, it is). But do humans exhibit a similar level of optimality? What I was hoping to see was some sort of analysis that would show that the types of errors the model makes are in some counterintuitive (or even intuitive) way like the types of errors humans make. In some of the papers where certain heuristics were proposed, the entire goal was to explain characteristic sub-optimalities in human behavior. Without such comparisons, I think the results might most effectively drive progress in purely computational circles.

  4. Reviewer #3 (Public Review):

    In this paper, the authors aim to understand what are the general computational principles that the brain uses for predicting stochastic environments ruled by underlying latent variables. For this, they analyze a particular class of artificial neuronal networks (ANNs) trained to predict stochastic environments. The authors compare the network performance with the optimal solution derived from Bayesian analysis as well as several heuristic algorithms. Importantly, the authors also perform several 'perturbation experiments' in which they take out specific elements of the network and study its performance. In particular, they study the role of gating variables, the recurrent connections, and the trained weights. By doing that authors can causally understand the role of these three mechanisms in the network's computations. The authors establish causal relationships of several important aspects of the flexibility of this network with these three network elements.

    This paper has several strengths. First, the authors systematically define stochastic environments based on graphical models. Second, the authors compared the network performance with several alternatives. Third, the authors perform thorough perturbation and decoding analyses establishing the connection between the particular elements of the network with the important characteristics of the network performance as changes in learning rate or adjustment to changes in baseline probability.

    However, even though the setup of the problem is interesting, as well as the analysis of the ANN seems correct there are three major weaknesses I see in this work that hinders the support to the importance of gaing and the relationship with neuroscience:

    1. The authors study a particular ANN (GRU) neuronal network that includes both the dynamic of the activity of the units as well as the dynamics of two 'gate variables'. It is unclear to me how much their conclusions -- constrained by choosing this particular network -- teach us about the brain. In particular, the relationship of these gate variables with actual synapses, neurons, or populations of neurons is at best speculative at this point. Additionally, claiming that 'gating is necessary' might be the result of using a GRU network in this particular task. Training RNNs have been a productive avenue for understanding neural computations in the past years, in many studies of this class networks are constrained or contrasted by experimental data (Mante and Sussillo et al, 2013, Rajan et al, 2016 or Finkelstein and Fontolan et al, 2021 as some examples), the present study is not constrained either contrasted with data. In most studies, comparison with population activity is relatively straightforward since they trained a network of rate units, which correspond to the 'mean field' description of the ensemble of neurons. In contrast, the present study uses a network where the relationship with biophysical elements of the brain is unknown, hindering the interpretation of their results. Additionally, it might be that other classes of machine learning networks, as for example LSTMs are also able to perform the tasks studied. Actually, LSTMs are able to perform similar computations like the ones in this study here as is shown in Wang and Kurt-Nelson et al, 2019. Different machine learning networks might use different computational strategies, which also hinders the generality claimed in the present paper for understanding computations in the brain.

    2. Although the authors analyze the network by performing several statistical analyses and numerical experiments. The authors did not try to understand the dynamics using standard mathematical tools from dynamical systems and statistical physics that have been used for study trained neuronal networks and understanding their computational mechanisms (see Susillo and Barak, 2013 or Dubreuil and Valente et al, bioRxiv as examples). Their network is 11-dimensional, it might be possible to 'open the box' and understand quantitatively the network dynamics and computations after training. In particular, the authors didn't try to understand at least numerically the geometry of neural representations of latent variables in network dynamics and how it is learned and depends on the environment. Additionally, by performing standard dynamical system analysis it might be possible to understand the role of gating in the network computations.

    3. There are several aspects of the writing of their analysis that clarified. In particular, key points as the optimal solution (which is used as a benchmark to all the other algorithms/networks) or the definition of precision is not fully clear.