Successively Pruned Q-Learning: Using Self Q-function to Reduce the Overestimation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
It’s well-known that the Q-learning algorithm suffers the overestimation owingto using the maximum state-action value as an approximation of the maximumexpected state-action value. Double Q-learning and other algorithms have beenproposed as efficient solutions to alleviate the overestimation. However, theseproposed methods intend to utilize multiple Q-functions to reduce the overes-timation and ignore the information of single Q-function. In this paper, 1) wereinterpret the update process of Q-learning, build a more precise model compat-ible with previous model. 2) We propose a novel and simple method to controlthe maximum bias by employing the information of single Q-function. 3) Ourmethod not only balances between the overestimation and the underestimation,but also attains the minimum bias under proper hyper-parameters. 4) Moreover,it can be naturally generalized to the discrete control domain and continuouscontrol tasks. We reveal that our algorithms outperform Double DQN and otheralgorithms on some representative games. Additionally, classical off-policy actor-critic algorithms also gain benefits from our method.Ultimately, we have extendedour algorithm to multi-agent reinforcement learning algorithms.