Policy-Gradient Reinforcement Learning as a General Theory of Practice-Based Motor Skill Learning

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    This valuable computational study presents a conceptually simple and biologically plausible reinforcement-learning framework for motor learning based on policy-gradient methods. The evidence supporting the conclusions is convincing, including rigorous mathematical derivations of learning rules for the mean and variance of motor commands and simulation results for three sets of experimental data, based on three different motor learning tasks from the literature. However, there is a lack of a clear description of the specific conditions under which this framework yields unique mechanistic insights or predictive values, hence falling short of qualifying as a "general theory of motor learning". The work will be of interest to researchers in computational motor learning and motor neuroscience.

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Mastering any new skill requires extensive practice, but the computational principles underlying this learning are not clearly understood. Existing theories of motor learning can explain short-term adaptation to perturbations, but offer little insight into the processes that drive gradual skill improvement through practice. Here, we propose that practice-based motor skill learning can be understood as a form of reinforcement learning (RL), specifically, policy-gradient RL, a simple, model-free method that is widely used in robotics and other continuous control settings. Here, we show that models based on policy-gradient learning rules capture key properties of human skill learning across a diverse range of learning tasks that have previously lacked any computational theory. We suggest that policy-gradient RL can provide a general theoretical framework and foundation for understanding how humans hone skills through practice.

Article activity feed

  1. eLife Assessment

    This valuable computational study presents a conceptually simple and biologically plausible reinforcement-learning framework for motor learning based on policy-gradient methods. The evidence supporting the conclusions is convincing, including rigorous mathematical derivations of learning rules for the mean and variance of motor commands and simulation results for three sets of experimental data, based on three different motor learning tasks from the literature. However, there is a lack of a clear description of the specific conditions under which this framework yields unique mechanistic insights or predictive values, hence falling short of qualifying as a "general theory of motor learning". The work will be of interest to researchers in computational motor learning and motor neuroscience.

  2. Reviewer #1 (Public review):

    Summary:

    This study proposes a simple and universal reinforcement-learning framework for understanding learning in complex motor tasks. Central to the framework is a policy-gradient algorithm, in which motor commands are updated not via the gradient of the reward with respect to policy parameters, but via the gradient of the policy itself, scaled by reward information. The authors demonstrate that this scheme can reproduce learning dynamics that have been reported in previous empirical studies.

    Strengths:

    The key contribution of this study lies in its application of a policy-gradient algorithm to describe motor learning processes. This idea is biologically plausible, as computing the gradient of the policy with respect to its parameters is likely to be substantially easier for the nervous system than computing the gradient of the reward with respect to policy parameters. The authors present three representative examples showing that this scheme can capture several aspects of motor learning dynamics. Notably, providing such a unified description across different tasks has been difficult for conventionally proposed learning frameworks, such as supervised learning.

    Weaknesses:

    While this scheme is valuable in that it captures certain aspects of learning dynamics, I find that its overall significance is limited for the following reasons.

    (1) The empirical results examined in this study primarily demonstrate that motor learning drives performance toward the spatial task goal while reducing variability. Given that the policies are expressed using Gaussian distributions and that their parameters (i.e., the mean and covariance matrix) are updated during learning, it is not surprising that the proposed scheme can reproduce these results by fitting the parameters to the data.

    (2) The proposed framework assumes that the motor learning system relies on the gradient of the policy with respect to its parameters. However, I am not convinced that this assumption is always appropriate, because in all three empirical studies examined here, explicit spatial error information is available. In such cases, the motor learning system could, in principle, compute the gradient of the error with respect to the policy parameters directly, without relying on a policy-gradient mechanism.

    (3) Most importantly, it remains unclear how the proposed scheme advances our understanding of the underlying learning mechanisms beyond providing a descriptive account of the learning process. While the framework offers a compact mathematical description of learning dynamics, it is uncertain how it can yield novel mechanistic insights or testable predictions that distinguish it from existing learning models.

  3. Reviewer #2 (Public review):

    Summary:

    In this study, Haith applies, and to some extent extends, the theoretical framework of policy gradient (PG) and the derived REINFORCE learning rules to human motor learning. This approach is coherent because human motor skill learning is characterized by improvements in both accuracy and precision (the inverse of variance), and REINFORCE provides update rules for both the mean and the variance of the motor commands.

    Weaknesses:

    The mean update (equation 4) is given in task space (i.e., angle and velocity for the skittle task), but the covariance update (equation 5) is given in eigenvector space. This formulation appears to have been provided for computational convenience, as it ensures that the variances are always positive by exponentiating the eigenvalues. However, this eigenspace formulation is somewhat artificial and complex (notably the update rule for the orientation of the covariance matrix) and seems far from biological reality. A simpler alternative, suggested by the author, is to provide the full covariance matrix, including crossed terms, and derive equations to update the diagonal variance terms and the cross-terms (perhaps after a transformation to keep all elements positive if needed). This would provide a simpler and more biologically plausible update to the covariance matrix terms, in the spirit of the original REINFORCE algorithm. The author suggests that he has derived the update rule for the cross terms, so this should be relatively easy to write and update, especially for the skittle learning rules. If the author wishes to keep their rules in simulations, then the two mathematical rules could be presented in the methods or a supplementary material section.

    The discussion about binary rewards and the increase in variance in previous experiments is potentially interesting. However, I do not understand why variance cannot increase with the policy-gradient RL update? Surely, equation 5 can lead to both an increase and a decrease in variance depending on the reward prediction error and the noise (for example, suppose the noise at trial i is small and leads to a smaller reward than the baseline; variance would increase). It would be interesting to see detailed simulation results for the skittle task showing changes in both mean and variance across a few consecutive trials, with both increases and decreases in reward prediction errors. These results could then be compared in simulations with those of a task with discrete binary rewards.

    Generalization is a major feature of human learning, but it is not discussed or studied here. In fact, in the de novo task simulations, there can be no generalization because the values are modeled as running averages for each target rather than derived from a critic network. Can the author discuss this point and, ideally, show generalization results in simulations, say in the skittle task?

    The application of the model to reproduce the Shmuelof et al. data is, at the same time, justified (because one of their main results is an improvement in precision, which Policy Gradient directly addresses) and somewhat "forced," as the author approximates curved movements with a series of straight-line movements. The author therefore needs to specify multiple via points with PG updating and a reward function that also enforces smoothness. The justification for the Guigon 2023 model seems somewhat artificial because it mainly applies to slow movements. Can the author comment and discuss alternatives that do not require via points, drawing from the robotics literature if needed (Schaal's Dynamic Movement Primitives come to mind, for example).

    Policy Gradient requires both a "noisy" and a clean "pass", making it non-biological in its simplest form. Legenstein et al. (2010) and Miconi (2017) provided biologically plausible forms for the mean update. Since Policy Gradient is proposed as a model of human motor learning, can the author discuss the biological plausibility of the proposed learning rules and possible biologically plausible extensions?