Human Strategy Adaptation in Reinforcement Learning Resembles Policy Gradient Ascent

Hua-Dong Xiong
Li Ji-An
Robert C. Wilson
Marcelo G. Mattar

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

A hallmark of intelligence is the ability to adapt behavior to changing environments, which requires adapting one’s own learning strategies. This phenomenon is known as learning to learn in cognitive science and meta-learning in artificial intelligence. While this phenomenon is well-established in humans and animals, no quantitative framework exists for characterizing the trajectories through which biological agents adapt their learning strategies. Previous computational studies that either assume fixed strategies or use task-optimized neural networks do not explain how humans refine strategies through experience. Here we show that humans adjust their reinforcement learning strategies resembling principles of gradient-based online optimization. We introduce DynamicRL, a framework using neural networks to track how participants’ learning parameters (e.g., learning rates and decision temperatures) evolve throughout experiments. Across four diverse bandit tasks, DynamicRL consistently outperforms traditional reinforcement learning models with fixed parameters, demonstrating that humans continuously adapt their strategies over time. These dynamically-estimated parameters reveal trajectories that systematically increase expected rewards, with updates significantly aligned with policy gradient ascent directions. Furthermore, this learning process operates across multiple timescales, with strategy parameters updating more slowly than behavioral choices, and update effectiveness correlates with local gradient strength in the reward landscape. Our work offers a generalizable approach for characterizing meta-learning trajectories, bridging theories of biological and artificial intelligence by providing a quantitative method for studying how adaptive behavior is optimized through experience.

Version published to 10.1101/2025.07.28.667308 on bioRxiv
Jul 31, 2025

DynamicRL: Data-Driven Estimation of Trial-by-Trial Reinforcement Learning Parameters

This article has 4 authors:
1. Hua-Dong Xiong
2. Li Ji-An
3. Marcelo G Mattar
4. Robert C Wilson
This article has no evaluationsLatest version Jul 20, 2025
Human reinforcement learning processes and biases: computational characterization and possible applications to behavioral public policy

This article has 1 author:
1. Stefano Palminteri
This article has no evaluationsLatest version Jun 16, 2025
Fictive Learning in Model-based Reinforcement Learning by Generalized Reward Prediction Errors

This article has 3 authors:
1. Jianning Chen
2. Masakazu Taira
3. Kenji Doya
This article has no evaluationsLatest version Jun 15, 2025

Listed in

Abstract

Article activity feed

Related articles

DynamicRL: Data-Driven Estimation of Trial-by-Trial Reinforcement Learning Parameters

Human reinforcement learning processes and biases: computational characterization and possible applications to behavioral public policy

Fictive Learning in Model-based Reinforcement Learning by Generalized Reward Prediction Errors