Monkey plays Pac-Man with compositional strategies and hierarchical decision-making

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This report presents findings of broad interest to behavioral, systems, and cognitive neuroscientists. The combination of a complex behavioral paradigm and sophisticated modeling provides significant insight and a novel approach to studying higher cognition in primates. Key clarifications are needed that have to do with better justification for the modeling strategy, selective comparisons within the data, and a more thorough consideration that subjects may employ a more passive strategy.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Humans can often handle daunting tasks with ease by developing a set of strategies to reduce decision-making into simpler problems. The ability to use heuristic strategies demands an advanced level of intelligence and has not been demonstrated in animals. Here, we trained macaque monkeys to play the classic video game Pac-Man. The monkeys’ decision-making may be described with a strategy-based hierarchical decision-making model with over 90% accuracy. The model reveals that the monkeys adopted the take-the-best heuristic by using one dominating strategy for their decision-making at a time and formed compound strategies by assembling the basis strategies to handle particular game situations. With the model, the computationally complex but fully quantifiable Pac-Man behavior paradigm provides a new approach to understanding animals’ advanced cognition.

Article activity feed

  1. Reviewer #1 (Public Review):

    In this manuscript, Yang et al. trained monkeys to play the classic video game Pac-Man and fit their behavior with a hierarchical decision making model. Adapting a complex behavior paradigm, like Pac-Man, in the testing of NHP is novel. The task was well-designed to help the monkeys understand the task elements step-by-step, which was confirmed by the monkeys' behavior. The authors reported that the monkeys adopted different strategies in different situations, and their decisions can be described by the model. The model predicted their behavior with over 90% accuracy for both monkeys. Hence, the conclusions are mostly supported by the data. As the authors claimed, the model can help quantify the complex behavior paradigm, providing a new approach to understanding advanced cognition in non-human primates. However, several aspects deserve clarification or modification.

    1. The results showed that the monkeys adopted different strategies in different situations, which is also well described by the model. However, the authors haven't tested whether the strategy was optimal in a given situation.

    Our approach to analyze monkeys’ behavior is not based on optimality. Instead, we centered around the strategies and showed that they described the monkeys’ behavior well. The model and its fitting process does not assume the monkeys were optimizing for something. Nevertheless, the fitting results suggested that the strategies that the monkeys chose were rational, which suggests validity of our model. As we have pointed out above, optimality is hard to define in such a complex game. In particular, most of the game is about collecting pellets, strategies that are only used in a small portion of the game can be ignored when searching for optimal solutions. We feel that further analyses on the issue of optimality would dilute the center message of the paper and choose not to include them here.

    According to the results, the monkeys didn't always perform the task in an optimal way, as well. Most of the time, the monkeys didn't actively adopt strategies in a long-term view. They were "passively" foraging in the task: chasing benefit and avoiding harm when they were approached. This "benefit-tending, harm-avoiding" instinct belongs to most of the creatures in the world, even in single-cell organisms. When a Paramecium is placed in a complex environment with multiple attractants and repellents, it may also behave dynamically by adopting a linear combination of basic tending/avoiding strategies, although in a simpler way. In other words, the monkeys were responding to the change of environment but not actively optimizing their strategy to achieve larger benefits with fewer efforts. The only exception is the suicides. Monkeys were proactively taking short-term harms to achieve large benefits in the future.

    One possible reason is that the monkeys didn't have enough pressure to optimize their choices since they will eventually get all the rewards no matter how many attempts they make. The only variable is the ghosts. Most of the time, the monkeys didn't really choose between different targets/ strategies. They were making choices between the chasing order of the options, but not the options themselves. It is similar to asking a monkey to choose either to eat a piece of grape or cucumber first, but not to choose one and give up the other one. A possible way to avoid this is to stop the game once the ghost catches the Pac-Man or limit each game's time.

    The game is designed to force the players to make decisions quickly to clear the pellets, otherwise the ghosts would catch Pac-Man. Even in the monkey version of the game where the monkeys always get another chance, Pac-Man deaths lead to long delays with no rewards. They will not be able to complete the game if they do not actively plan their route, especially in the late stage when they must reach the scarcely placed dots while escaping from the ghosts. In addition, we provided additional rewards when a maze is cleared in fewer rounds (20 drops if in 1 to 3 rounds; 10 drops if in 4 to 5 rounds; and 5 drops if in more than 5 rounds), which added motivation for the monkeys to complete a game quickly.

    The monkeys’ behavior also suggested that they did not just adopt a passive strategy. Our analyses of the planned attack and suicide behavior clearly demonstrated that the monkeys actively made plans to change the game into more desirable states. Such behavior cannot be explained with a passive foraging strategy.

    2. It is well known that the value of an element is discounted by time and distance. However, in the model, the authors didn't consider it. A relevant problem will be the utility of the bonus elements, including the fruits and scared ghosts. Their utilities were affected not only by their value defined by the authors but also by effects, including their novelty and sense of achievement when they were captured, as the ghosts attracted relatively much more attention than the other elements (considering the number is 2 for them, see in figure 3E).

    These are good points, and our strategies could be built with more complexity to account for other potential factors. However, we focused our investigation on how to account for monkeys’ behavior with a set of strategies. A set of simple strategies with a small number of parameters would make a strong argument.

    Using a complex game such as Pac-Man allows us to investigate all of these interesting cognitive processes. We can certainly look at them in the future.

    3. The strategies are not independent. They are somehow correlated to each other. It may result in, in some conditions, false alarming of more strategies than the real, as shown in figure 2A.

    We have computed the Pearson correlations between the action sequences chosen with each basis strategy within each coarse-grained segment determined by the two-pass fitting procedure. As a control, we computed the correlation between each basis strategy and a random strategy, which generates action randomly, as a baseline. Most strategy pairs' correlations were lower than the random baseline. The results were now included in Supplementary (Appendix Figure 3).

    Sometimes two strategies may give exactly the same action sequence in a game segment. To deal with this problem, now we include an extra step when we fit the model to the behavior, which was described in Methods:

    “To ensure that the fitted weights are unique (Buja et al., 1989) in each time window, we combine utilities of any strategies that give exactly the same action sequence and reduce multiple strategy terms (e.g., local and energizer) to one hybrid strategy (e.g., local+energizer). After MLE fitting, we divide the fitted weight for this hybrid strategy equally among the strategies that give the same actions in the time segments.“

    Moreover, as the reviewer correctly reasoned, correlations between the strategies would yield possibly more strategies. However, our finding is that the monkeys were using a single strategy most of the time. This possible false alarm would go against our claim. Our conclusions stand despite the strategy correlations.

    It is hard to believe that a monkey can maintain several strategies simultaneously since it is out of our working memory/attention capacity.

    Exactly, and we are among the first to quantitatively demonstrate that the monkeys’ mostly relied on single strategies to play the game.

    Reviewer #2 (Public Review):

    In this intriguing paper, Yang et al. examine the behaviors of two rhesus monkeys playing a modified version of the well-known Pac-Man video game. The game poses an interesting challenge, since it requires flexible, context-dependent decisions in an environment with adversaries that change in real time. Using a modeling framework in which simple "basic" strategies are ensembled in a time-dependent fashion, the authors show that the animals' choices follow some sensible rules, including some counterintuitive strategies (running into ghosts for a teleport when most remaining pellets are far away).

    I like the motivation and findings of this study, which are likely to be interesting to many researchers in decision neuroscience and animal behavior. Many of the conclusions seem reasonable, and the results are detailed clearly. The key weakness of the paper is that it is primarily descriptive: it's hard to tell what new generalizable knowledge we take away from this model or these particular findings. In some ways, the paper reads as a promissory note for future studies (neural or behavioral or both) that might make use of this paradigm.

    I have two broad concerns, one mostly technical, one conceptual:

    First, the modeling framework, while adequate, is a bit ad hoc and seems to rely on many decisions that are specific to exactly this task. While I like the idea of modeling monkeys' choices using ensembling, the particular approach taken to segment time and the two-pass strategy for smoothing ensemble weights is only one of many possible approaches, and these decisions aren't particularly well-motivated. They appear to be reasonable and successful, but there is not much in the paper to connect them with better-known approaches in reinforcement learning (or, perhaps surprisingly, hierarchical reinforcement learning) that could link this work to other modeling approaches. In some ways, however, this is a question of taste, and nothing here is unreasonable.

    Thanks for the suggestion. In the new revision, we include a linear approximate reinforcement learning model (LARL) (Sutton, 1988; Tsitsiklis & Van Roy, 1997). The LARL model shared the same structure with a standard Q-learning algorithm but used the monkeys’ actual joystick movements as the fitting target. The model, although computationally more complex than the hierarchical mode, achieves a worse fitting performance.

    Second, there is an elision here of the distinction between how one models monkeys' behavior and what monkeys can be said to be "doing." That is, a model may be successful at making predictions while not being in any way a good description of the underlying cognitive or neuroscientific operations. More concretely: when we claim that a particular model of behavior is what agents "actually do," what we are usually saying is that (a) novel predictions from this model are born out by the data in ways that predictions from competing models are not (b) this model gives a better quantitative account of existing data than competitors. Since the present study is not designed as a test of the ensembling model (a), then it needs to demonstrate better quantitative predictions (b).

    We concede to the point that our model, while fitting to the behavior well, does not directly prove that the monkeys actually solved the task in this way. The eye movement and pupil dilation analyses partly addressed this issue, as their results were consistent with what one would expect from the model. We also hope future recording experiments will provide neural evidence to support the model.

    But the baselines used in this study are both limited and weak. A model crafted by the authors to use only a single, fixed ensemble strategy correctly predicts 80% of choices, while the model with time-varying ensembling predicts roughly 90%. This is a clear improvement and some evidence that *if* the animals are ensembling strategies, they are changing the ensemble weights in time. But there is little here in the way of non-ensemble competitors. What about a standard Q-learning model with an inferred reward function (that is, trained to replicate monkeys' data, not optimal performance). The perceptron baseline as detailed seems very poor as a control given how shallow it is. That is, I'm not convinced that the authors have successfully ruled out "flat" models as explanations of this behavior, only found that an ensembled model offers a reasonable explanation.

    We hope the new LARL model provides a better baseline control as a flat model. It performs better than the perceptron, yet much worse than our hierarchical model. Yet, we must point out that any hierarchical models can be matched in performance with a flat model in theory (Ribas-Fernandes et al., 2011). The advantage of hierarchical models mainly lies in their smaller computational cost for efficient planning. Even in a much simpler task such as a four-room navigation task, a hierarchical model can plan much faster than a flat model, especially under conditions with limited working memory (M. Botvinick & Weinstein, 2014). Our Pac-Man task contains an extensive feature space while requiring real-time decision-making. The result is that a reasonably performing flat model would go beyond the limits of the cognitive resources available in the brain. Even for a complex flat model such as Deep Q-Network (it can be considered to be similar a flat model since it does not explicitly plan with temporal extended strategies (Mnih et al., 2015)), the game performance is much worse than a hierarchical model (Van Seijen et al., 2017). The performance of the monkeys was unlikely to be achieved with a flat model. In addition, we trained the monkeys by introducing the game concepts gradually, with each training stage focusing on certain game aspects. The training procedure may have encouraged the monkeys to generalize the skills acquired in the early stages and use them as the basis strategies in the later training stages when the monkeys faced the complete version of the Pac-Man task.

    Reviewer #3 (Public Review):

    Yang and colleagues present a tour de force paper demonstrating non-human primates playing a full on pac-man video game. The authors reason that using a highly complex, yet semi controlled video game allows for the analysis of heuristic strategies in an animal model. The authors perform a set of well motivated computational modeling approaches to demonstrate the utility of the experimental model.

    First, I would like to congratulate the authors on training non-human primates to perform such a complex and demanding task and demonstrating that NHP perform this task well. From previous papers we know that even complex AI systems have difficulty with this task and extrapolating from my own failings in playing pac-man it is a difficult game to play.

    Overall the analysis approach used in the paper is extremely well reasoned and executed but what I am missing (and I must add is not needed for the paper to be impactful on its own) is a more exhaustive model search. The deduction the authors follow is logically sound but builds very much on assumptions of the basic strategy stratification performed first. This means that part of the hierarchical aspect of the behavioral strategies used can be attributed to the heuristic stratification nature of the approach. I am not trying to imply that I do not think that the behavior is hierarchically organized but I am implying that there is a missed opportunity to characterize that hierchical'ness (maybe in a graph theoretical way, think Dasgupta scores) further.

    All in all this paper is wonderful. Congratulations to the authors.

    We thank the reviewer for the encouraging comments. We have included a new flat model in the new revision for comparison against our hierarchical model and discussed other experimental evidence to support our claim.

  2. Evaluation Summary:

    This report presents findings of broad interest to behavioral, systems, and cognitive neuroscientists. The combination of a complex behavioral paradigm and sophisticated modeling provides significant insight and a novel approach to studying higher cognition in primates. Key clarifications are needed that have to do with better justification for the modeling strategy, selective comparisons within the data, and a more thorough consideration that subjects may employ a more passive strategy.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    In this manuscript, Yang et al. trained monkeys to play the classic video game Pac-Man and fit their behavior with a hierarchical decision making model. Adapting a complex behavior paradigm, like Pac-Man, in the testing of NHP is novel. The task was well-designed to help the monkeys understand the task elements step-by-step, which was confirmed by the monkeys' behavior. The authors reported that the monkeys adopted different strategies in different situations, and their decisions can be described by the model. The model predicted their behavior with over 90% accuracy for both monkeys. Hence, the conclusions are mostly supported by the data. As the authors claimed, the model can help quantify the complex behavior paradigm, providing a new approach to understanding advanced cognition in non-human primates. However, several aspects deserve clarification or modification.

    1. The results showed that the monkeys adopted different strategies in different situations, which is also well described by the model. However, the authors haven't tested whether the strategy was optimal in a given situation. According to the results, the monkeys didn't always perform the task in an optimal way, as well. Most of the time, the monkeys didn't actively adopt strategies in a long-term view. They were "passively" foraging in the task: chasing benefit and avoiding harm when they were approached. This "benefit-tending, harm-avoiding" instinct belongs to most of the creatures in the world, even in single-cell organisms. When a Paramecium is placed in a complex environment with multiple attractants and repellents, it may also behave dynamically by adopting a linear combination of basic tending/avoiding strategies, although in a simpler way. In other words, the monkeys were responding to the change of environment but not actively optimizing their strategy to achieve larger benefits with fewer efforts. The only exception is the suicides. Monkeys were proactively taking short-term harms to achieve large benefits in the future.

    One possible reason is that the monkeys didn't have enough pressure to optimize their choices since they will eventually get all the rewards no matter how many attempts they make. The only variable is the ghosts. Most of the time, the monkeys didn't really choose between different targets/ strategies. They were making choices between the chasing order of the options, but not the options themselves. It is similar to asking a monkey to choose either to eat a piece of grape or cucumber first, but not to choose one and give up the other one. A possible way to avoid this is to stop the game once the ghost catches the Pac-Man or limit each game's time.

    2. It is well known that the value of an element is discounted by time and distance. However, in the model, the authors didn't consider it. A relevant problem will be the utility of the bonus elements, including the fruits and scared ghosts. Their utilities were affected not only by their value defined by the authors but also by effects, including their novelty and sense of achievement when they were captured, as the ghosts attracted relatively much more attention than the other elements (considering the number is 2 for them, see in figure 3E).

    3. The strategies are not independent. They are somehow correlated to each other. It may result in, in some conditions, false alarming of more strategies than the real, as shown in figure 2A. It is hard to believe that a monkey can maintain several strategies simultaneously since it is out of our working memory/attention capacity.

  4. Reviewer #2 (Public Review):

    In this intriguing paper, Yang et al. examine the behaviors of two rhesus monkeys playing a modified version of the well-known Pac-Man video game. The game poses an interesting challenge, since it requires flexible, context-dependent decisions in an environment with adversaries that change in real time. Using a modeling framework in which simple "basic" strategies are ensembled in a time-dependent fashion, the authors show that the animals' choices follow some sensible rules, including some counterintuitive strategies (running into ghosts for a teleport when most remaining pellets are far away).

    I like the motivation and findings of this study, which are likely to be interesting to many researchers in decision neuroscience and animal behavior. Many of the conclusions seem reasonable, and the results are detailed clearly. The key weakness of the paper is that it is primarily descriptive: it's hard to tell what new generalizable knowledge we take away from this model or these particular findings. In some ways, the paper reads as a promissory note for future studies (neural or behavioral or both) that might make use of this paradigm.

    I have two broad concerns, one mostly technical, one conceptual:

    First, the modeling framework, while adequate, is a bit ad hoc and seems to rely on many decisions that are specific to exactly this task. While I like the idea of modeling monkeys' choices using ensembling, the particular approach taken to segment time and the two-pass strategy for smoothing ensemble weights is only one of many possible approaches, and these decisions aren't particularly well-motivated. They appear to be reasonable and successful, but there is not much in the paper to connect them with better-known approaches in reinforcement learning (or, perhaps surprisingly, hierarchical reinforcement learning) that could link this work to other modeling approaches. In some ways, however, this is a question of taste, and nothing here is unreasonable.

    Second, there is an elision here of the distinction between how one models monkeys' behavior and what monkeys can be said to be "doing." That is, a model may be successful at making predictions while not being in any way a good description of the underlying cognitive or neuroscientific operations. More concretely: when we claim that a particular model of behavior is what agents "actually do," what we are usually saying is that (a) novel predictions from this model are born out by the data in ways that predictions from competing models are not (b) this model gives a better quantitative account of existing data than competitors. Since the present study is not designed as a test of the ensembling model (a), then it needs to demonstrate better quantitative predictions (b).

    But the baselines used in this study are both limited and weak. A model crafted by the authors to use only a single, fixed ensemble strategy correctly predicts 80% of choices, while the model with time-varying ensembling predicts roughly 90%. This is a clear improvement and some evidence that *if* the animals are ensembling strategies, they are changing the ensemble weights in time. But there is little here in the way of non-ensemble competitors. What about a standard Q-learning model with an inferred reward function (that is, trained to replicate monkeys' data, not optimal performance). The perceptron baseline as detailed seems very poor as a control given how shallow it is. That is, I'm not convinced that the authors have successfully ruled out "flat" models as explanations of this behavior, only found that an ensembled model offers a reasonable explanation.

  5. Reviewer #3 (Public Review):

    Yang and colleagues present a tour de force paper demonstrating non-human primates playing a full on pac-man video game. The authors reason that using a highly complex, yet semi controlled video game allows for the analysis of heuristic strategies in an animal model. The authors perform a set of well motivated computational modeling approaches to demonstrate the utility of the experimental model.

    First, I would like to congratulate the authors on training non-human primates to perform such a complex and demanding task and demonstrating that NHP perform this task well. From previous papers we know that even complex AI systems have difficulty with this task and extrapolating from my own failings in playing pac-man it is a difficult game to play.

    Overall the analysis approach used in the paper is extremely well reasoned and executed but what I am missing (and I must add is not needed for the paper to be impactful on its own) is a more exhaustive model search. The deduction the authors follow is logically sound but builds very much on assumptions of the basic strategy stratification performed first. This means that part of the hierarchical aspect of the behavioral strategies used can be attributed to the heuristic stratification nature of the approach. I am not trying to imply that I do not think that the behavior is hierarchically organized but I am implying that there is a missed opportunity to characterize that hierchical'ness (maybe in a graph theoretical way, think Dasgupta scores) further.

    All in all this paper is wonderful. Congratulations to the authors.