Human Strategy Adaptation in Reinforcement Learning Resembles Policy Gradient Ascent

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

A hallmark of intelligence is the ability to adapt behavior to changing environments, which requires adapting one’s own learning strategies. This phenomenon is known as learning to learn or meta-learning. Although well established in humans and animals, a computational framework that characterizes how biological agents adapt their learning strategies through experience remains elusive. Here we posit that humans update their learning strategies online through a gradient-based meta-learning process, effectively optimizing how they learn. However, estimating how these strategies evolve over time remains a significant challenge since traditional cognitive models, such as reinforcement learning (RL), typically assume that agents use static strategies. To address this, we introduce DynamicRL , a method that leverages neural networks to estimate the evolution of an individual’s RL strategy by tracking cognitive parameters such as learning rates over time. Across four human bandit tasks, DynamicRL consistently outperforms traditional RL models with fixed parameters in fitting behavior, confirming that humans adapt their RL strategies over time. RL parameters estimated by DynamicRL reveal trajectories that systematically increase the expected reward of the RL strategy. The parameter updates at each step resemble policy gradient ascent, and their optimality correlates with the strength of the gradient signal. Moreover, these RL parameters evolve more slowly than decision variables, supporting the hierarchical relationship between strategy learning and value learning. Our work provides a computational framework that expands the hypothesis space from understanding strategies to understanding strategy adaptation, bridging adaptive behavior in biological and artificial intelligence through meta-learning.

Article activity feed