A neuromorphic model of active vision shows how spatiotemporal encoding in lobula neurons can aid pattern recognition in bees

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    Inspired by bee's visual behavior, the goal of the manuscript is to develop a model of visual scanning, visual processing and learning to recognize visual patterns. In this model, pre-training with natural images leads to the formation of spatiotemporal receptive fields that can support associative learning. Due to an incomplete test of the necessity and sufficiency of the features included in the model, it cannot be concluded that the model is either the "minimal circuit" or the most biologically plausible circuit of this system. With a more in-depth analysis, the work has the potential of being important and very valuable to both experimental and computational neurobiologists.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Bees’ remarkable visual learning abilities make them ideal for studying active information acquisition and representation. Here, we develop a biologically inspired model to examine how flight behaviours during visual scanning shape neural representation in the insect brain, exploring the interplay between scanning behaviour, neural connectivity, and visual encoding efficiency. Incorporating non-associative learning—adaptive changes without reinforcement—and exposing the model to sequential natural images during scanning, we obtain results that closely match neurobiological observations. Active scanning and non-associative learning dynamically shape neural activity, optimising information flow and representation. Lobula neurons, crucial for visual integration, self-organise into orientation-selective cells with sparse, decorrelated responses to orthogonal bar movements. They encode a range of orientations, biased by input speed and contrast, suggesting co-evolution with scanning behaviour to enhance visual representation and support efficient coding. To assess the significance of this spatiotemporal coding, we extend the model with circuitry analogous to the mushroom body, a region linked to associative learning. The model demonstrates robust performance in pattern recognition, implying a similar encoding mechanism in insects. Integrating behavioural, neurobiological, and computational insights, this study highlights how spatiotemporal coding in the lobula efficiently compresses visual features, offering broader insights into active vision strategies and bio-inspired automation.

Impact statements

Active vision dynamically refines spatiotemporal neural representations, optimising visual processing through scanning behaviour and non-associative learning, providing insights into efficient sensory encoding in dynamic environments.

Article activity feed

  1. eLife Assessment

    Inspired by bee's visual behavior, the goal of the manuscript is to develop a model of visual scanning, visual processing and learning to recognize visual patterns. In this model, pre-training with natural images leads to the formation of spatiotemporal receptive fields that can support associative learning. Due to an incomplete test of the necessity and sufficiency of the features included in the model, it cannot be concluded that the model is either the "minimal circuit" or the most biologically plausible circuit of this system. With a more in-depth analysis, the work has the potential of being important and very valuable to both experimental and computational neurobiologists.

  2. Reviewer #1 (Public Review):

    Insects, such as bees, are surprisingly good at recognizing visual patterns. How they achieve this challenging task with limited computational resources is not fully understood. Based on the actual bee's behaviour and visual circuit structure, MaBouDi et al. constructed a biologically plausible model where the circuit extracts essential visual features from scanned natural scenes. The model successfully discriminated a variety set of visual patterns as the actual bee does. By implementing a type of Hebb's rule for non-associative learning, an early layer of the model extracted orientational information from natural scenes essential to pattern recognition. Throughout the paper, the authors provided intuitive logic for how the relatively simple circuit could achieve pattern recognition. This work could draw broad attention not only in visual neuroscience but also in computer vision.

    However, there are a number of weaknesses in the manuscript. 1) The authors claim that the model is inspired by micromorphology, yet it does not rigorously follow the detailed anatomy of the insect brain revealed as of now. 2) Some claims sound a bit too strong compared to what the authors demonstrated with the model. For example, when the authors say the model is minimal, the authors simply investigated how many lobula neurons are required for pattern discrimination in the model. However, the manuscript appears to use this to claim that the presented model is the minimal one required for visual tasks. 3) It lacks explanations of what mechanisms in the model could discriminate some patterns but not others, making the descriptions very qualitative. 4) The authors did not provide compelling evidence that the algorithm is particularly tuned to natural scenes.

  3. Reviewer #2 (Public Review):

    This study is inspired by the scanning movements observed in bees when performing visual recognition tasks. It uses a multilayered network, representing stages of processing in the visual lobes (lamina, medulla, lobula), and uses the lobula output as input to a model of associative learning in the mushroom body (MB). The network is first trained with short "scanning" sequences of natural images, in a non-associative adaptation process, and then several experimental paradigms where images are rewarded or punished are simulated, with the output of the MB able to provide the appropriate discriminative decisions (in some but not all cases). The lobula receptive fields formed by the initial adaptation process show spatiotemporal tuning to edges moving at particular orientations and speeds that are comparable to recorded responses of such neurons in the insect brain.

    There are two main limitations to the study in my view. First, although described (caption fig 1) as a model "inspired by the micromorphology" of the insect brain, implying a significant degree of accuracy and detail, there are many arbitrary features (unsupported by current connectomics). For example, the strongly constrained delay line structure from medulla to­ lobula neurons, and the use of a single MB0N that has input synapses that undergo facilitation and decay according to different neuromodulators. Second, while it is reasonable to explore some arbitrary architectural features, given that not everything is yet known about these pathways, the presented work does not sufficiently assess the necessity and sufficiency of the different components, given the repeated claims that this is the "minimal circuit" required for the visual tasks explored.

    Regarding the mushroom body (MB) learning model, it is strange that no reference is made to recent models closely tied to connectomic and other data in fruit flies, which suggests separate MBONS encode positive vs. negative value; that learning is not dependent on MB0N activity (so is not STDP); that feedback from MBONs to dopaminergic signalling plays an important role, etc. Possibly the MB of the bee operates in a completely different way to the fly, but the presented model relies on relatively old data about MB function, mostly from insects other than bees (e.g. locust) so its relationship to the increasingly comprehensive understanding emerging for the fly MB needs to be clarified. It is implied that the complex interaction of the differential effects of dopamine and octopamine, as modelled here, are required to learn the more complex visual paradigms, but it is not actually tested if simpler rules might suffice. Also, given previous work on models of view recognition in the MB, inspired by bees and ants, it seems plausible that simply using static 25×25 medulla activity as input to produce sparse activity in the KCs would be sufficient for MB0N output to discriminate the patterns used in training, including the face stimulus. Thus it is not clear whether the spatiotemporal input and the lobula encoding are necessary to solve these tasks.

    It is also difficult to interpret the range of results in fig 3. The network sometimes learns well, sometimes just adequately (perhaps comparable to bees), and sometimes fails. The presentation of these results does not seem to identify any coherent pattern underlying success or failure, other than that the ability to generalise seems limited. That is, recognition (in most cases) requires the presentation of exactly the same stimulus in exactly the same way (same scanning pattern, distance and speed). In particular, it is hard to know what to conclude when the network appears able to learn some "complex patterns" (spirals, faces) but fails to learn the apparently simple plus vs. multiplication symbol discrimination if it is trained and tested with a scan passing across the whole pattern instead of just the lower half.

    In summary, although it is certainly interesting to explore how active vision (scanning a visual pattern) might affect the encoding of stimuli and the ability to learn to discriminate rewarding stimuli, some claims in the paper need to be tempered or better supported by the demonstration that alternative, equally plausible, models of the visual and mushroom body circuits are not sufficient to solve the given tasks.

  4. Reviewer #3 (Public Review):

    In this manuscript, the authors use the data collected and observations made on bees' scanning behaviour during visual learning to design a bio-inspired artificial neural network. The network follows the architecture of bees visual systems, where photoreceptors project into the lamina, then the medulla, medulla neurons connect to a set of spiking neurons in the lobula. Lobula neurons project to kenyon cells and then to MBON, which controls reward and punishment. The authors then test the performance of the network in comparison with real bee data, finding it to perform well in all tasks. The paper attempts to reproduce a living organism network with a practical application in mind, and it is quite impressive! I appreciate both the potential implications for the understanding of biological systems and the applications in the development of autonomous agents, making the paper absolutely worth reading.

    However, I believe that the current version somewhat lacks in clarity regarding the methodology and in some of the keywords used to describe the model.

    Definitions:

    Throughout the manuscript, the authors use some key terminology that I believe would benefit from some clarification.

    The generated model is described in the title and once in the introduction as "neuromorphic". The model is definitely bio-inspired, but at least in some layers of the neural network, the model is built very differently from actual brain connectivity. Generally, when we use the term neuromorphic we imply many advantages of neural tissue, like energy efficiency, that I am not sure the current model is achieving. I absolutely see how this work is going in that direction, and I also fundamentally agree with the choice of terminology, but this should be clearly explained to not risk over-implications

    The authors describe this as a model of "active vision". This is done in the title of the article, and in the many paragraph headings (methods, results). In the introduction, however, the term active vision is reserved to the description of bees' behavior. Indeed, the developed model is not a model of active vision, as this would require for the model to control the movement of the "camera". Here instead the stimuli display is given to the model in a fixed progression. What I suspect is that the authors' aim is to describe a model that supports the bees' active vision, not a model of active vision. I believe this should be very clear from the paper, and it may be appropriate to remove the term from the title.

    In the short title, it said that this network is minimal. This is then characterized in the introduction as the minimal network capable of enabling active vision in bees. The authors, however, in their experiment only vary the number of lobula neurons, without changing other parts of the architecture. Given this, we can only say that 16 lobula neurons is the minimal number required to solve the experimental task with the given model. I don't believe that this is generalizable to bees, nor that this network is minimal, as there may be different architectures (for the other layers especially) that require overall less neurons. Moreover, the tasks attempted in the minimal network experiment did not include any of the complex stimuli presented in figure 3, like faces. It may be that 16 lobula neurons are sufficient for the X vs + and clockwise vs counter-clockwise spirals, but we do not know if increasing stimuli complexity would result in a failure of the model with 16 neurons.

    Methodology:

    The current explanation of the model is currently a bit lacking in clarity and details. This risks impacting negatively on the relevance of the whole work which is interesting and worth reading! This issue affects also the interpretation of the results, as it is not clear to what extent each part of the network could affect the results shown. This is especially the case when the network under-performs with respect to the best performing scenario (e.g., when varying the speed and part of the pattern that is observed, such as in Fig 2C). Adding a detailed technical scheme/drawing specific to the network architecture could have been a way of significantly increasing the clarity of the Methods section and the interpretation of the results.

    On a similar note, the authors make some comparisons between the model and real bees. However, it remains unclear whether these similarities are actually indicative of an optimality in the bees visual scanning strategy, or just deriving from the authors design. This is for me particularly important in the experiments aimed at finding the best scanning procedure. If the initial model training is based on natural images it is performed by presenting left to right moving frames, the highest efficiency of lower-half scanning may be due to how the weights in the initial layers are structured and a low generalizability of the model, rather than to the strategy optimality

  5. Author response:

    Reviewer #1 (Public Review):

    Insects, such as bees, are surprisingly good at recognizing visual patterns. How they achieve this challenging task with limited computational resources is not fully understood. Based on the actual bee's behaviour and visual circuit structure, MaBouDi et al. constructed a biologically plausible model where the circuit extracts essential visual features from scanned natural scenes. The model successfully discriminated a variety set of visual patterns as the actual bee does. By implementing a type of Hebb's rule for non-associative learning, an early layer of the model extracted orientational information from natural scenes essential to pattern recognition. Throughout the paper, the authors provided intuitive logic for how the relatively simple circuit could achieve pattern recognition. This work could draw broad attention not only in visual neuroscience but also in computer vision.

    We appreciate your positive feedback.

    However, there are a number of weaknesses in the manuscript. 1) The authors claim that the model is inspired by micromorphology, yet it does not rigorously follow the detailed anatomy of the insect brain revealed as of now. 2) Some claims sound a bit too strong compared to what the authors demonstrated with the model. For example, when the authors say the model is minimal, the authors simply investigated how many lobula neurons are required for pattern discrimination in the model. However, the manuscript appears to use this to claim that the presented model is the minimal one required for visual tasks. 3) It lacks explanations of what mechanisms in the model could discriminate some patterns but not others, making the descriptions very qualitative. 4) The authors did not provide compelling evidence that the algorithm is particularly tuned to natural scenes.

    We appreciate the reviewer's constructive feedback and have revised the manuscript to clarify and strengthen our claims. Below, we address each of the concerns raised:

    (1) The model does not rigorously follow the detailed anatomy of the insect brain

    We acknowledge that our model is an abstraction rather than a direct reproduction of the full micromorphology of the insect brain. The goal of our study was not to replicate every anatomical feature but rather to extract the core computational principles underlying active vision, based on the functional activity of insect brain. Although the recent connectome studies provide detailed structural maps, they do not fully capture the functional dynamics of sensory processing and behavioural outcomes. Our model integrates key neurobiological insights, including the hierarchical structure of the optic lobes, lateral inhibition in the lobula, and non-associative learning mechanisms shaping spatiotemporal receptive fields.

    However, to address this concern, we have revised the introduction and discussion to explicitly acknowledge the model’s level of abstraction and its relationship to the known anatomy of the insect visual system. Furthermore, we highlight future directions in which connectomic data could refine our model.

    (2) Strength of claims regarding minimality of the model

    We appreciate the reviewer’s concern regarding the definition of a "minimal" model. Our intention was not to claim that this model represents the absolute minimal neural architecture for visual learning task but rather that it identifies a minimal set of necessary computational elements that enable pattern discrimination in insects. To clarify this, we have refined the text to ensure that our conclusions about minimality are explicitly tied to the specific constraints and assumptions of our model. For instance, in the revised manuscript, we emphasise that our findings demonstrate how the number of lobula neurons, inhibitory lateral connection, non-associative learning model, affect neural representation and discrimination performance, rather than establishing an absolute lower bound on the complexity required for visual processing in insects.

    (3) Mechanistic explanations for pattern discrimination

    Thank you for highlighting this point. We have conducted a more detailed analysis of the model’s response to different patterns and expanded our discussion of the underlying mechanisms. To address this, we have refined our explanation of how different scanning strategies and temporal integration mechanisms contribute to neural selectivity in the lobula and overall discrimination performance. Specifically:

    - Figure 3 illustrates how the model benefits from generating sparse coding in the visual network, leading to improved performance in pattern recognition tasks.

    - Figure 5 now includes a more detailed explanation of how different scanning strategies influence the selectivity and separability of lobula neuron responses. Additionally, we provide further analysis of why the model successfully discriminates certain patterns (e.g., simple oriented bars) but struggles with more complex spatially structured quadrant-based patterns.

    - We elaborate on how sequential sampling, temporal coding, and lateral inhibition collectively shape neural representations, enabling the model to distinguish between visual stimuli effectively.

    These refinements provide a clearer mechanistic explanation of the model’s strengths and limitations, ensuring a more comprehensive understanding of its function.

    (4) Evidence that the model is tuned to natural scenes

    We have revised the manuscript to provide stronger support for the claim that the model is particularly adapted to natural scenes. Specifically:

    - Figure 3 demonstrates that training on natural images leads to sparse, decorrelated responses in the lobula, a hallmark of efficient coding observed in biological systems.

    - Supplementary Figure 2-1B shows that training with shuffled images fails to produce structured receptive fields, reinforcing that the statistical structure of natural images is crucial for efficient learning.

    - We now explicitly discuss how the receptive fields emerging from non-associative learning align with known orientation-selective responses in insect visual neurons, supporting the idea that the model is optimised for processing natural visual inputs (Figures 3, 6) and discussion section.

    Taken together, these revisions clarify how the model captures fundamental principles of insect vision without making overly strong claims about biological fidelity. We thank the reviewer for these insightful comments; addressing them has significantly strengthened the clarity and depth of our manuscript.

    Reviewer #2 (Public Review):

    This study is inspired by the scanning movements observed in bees when performing visual recognition tasks. It uses a multilayered network, representing stages of processing in the visual lobes (lamina, medulla, lobula), and uses the lobula output as input to a model of associative learning in the mushroom body (MB). The network is first trained with short "scanning" sequences of natural images, in a non-associative adaptation process, and then several experimental paradigms where images are rewarded or punished are simulated, with the output of the MB able to provide the appropriate discriminative decisions (in some but not all cases). The lobula receptive fields formed by the initial adaptation process show spatiotemporal tuning to edges moving at particular orientations and speeds that are comparable to recorded responses of such neurons in the insect brain.

    There are two main limitations to the study in my view. First, although described (caption fig 1) as a model "inspired by the micromorphology" of the insect brain, implying a significant degree of accuracy and detail, there are many arbitrary features (unsupported by current connectomics). For example, the strongly constrained delay line structure from medulla to­ lobula neurons, and the use of a single MB0N that has input synapses that undergo facilitation and decay according to different neuromodulators. Second, while it is reasonable to explore some arbitrary architectural features, given that not everything is yet known about these pathways, the presented work does not sufficiently assess the necessity and sufficiency of the different components, given the repeated claims that this is the "minimal circuit" required for the visual tasks explored.

    We appreciate your feedback and have refined the manuscript to clarify model design choices and address concerns regarding minimality.

    (1) Model Architecture and Functional Simplifications
    While our model is inspired by insect visual system, it is not intended as an exact anatomical reconstruction but rather a functional abstraction to uncover key computational principles of active vision and visual learning. The delay-line structure and simplified MBON implementation were deliberate choices to enable spatiotemporal encoding and associative learning without overcomplicating the model. As connectome data alone do not fully reveal functional relationships, our approach serves as a hypothesis-generating tool for future neurobiological studies.

    (2) Necessity and Sufficiency of Model Components
    We have removed overstatements about minimality and now clarify that our model represents a functional circuit rather than the absolute minimal configuration. Additionally, we conducted new control experiments assessing the influence of different model components, and further justifying key mechanisms such as spatiotemporal encoding and lateral inhibition.

    For a more detailed discussion of these revisions and improvements, please refer to our response to the Journal, above.

    Regarding the mushroom body (MB) learning model, it is strange that no reference is made to recent models closely tied to connectomic and other data in fruit flies, which suggests separate MBONS encode positive vs. negative value; that learning is not dependent on MB0N activity (so is not STDP); that feedback from MBONs to dopaminergic signalling plays an important role, etc. Possibly the MB of the bee operates in a completely different way to the fly, but the presented model relies on relatively old data about MB function, mostly from insects other than bees (e.g. locust) so its relationship to the increasingly comprehensive understanding emerging for the fly MB needs to be clarified. It is implied that the complex interaction of the differential effects of dopamine and octopamine, as modelled here, are required to learn the more complex visual paradigms, but it is not actually tested if simpler rules might suffice. Also, given previous work on models of view recognition in the MB, inspired by bees and ants, it seems plausible that simply using static 25×25 medulla activity as input to produce sparse activity in the KCs would be sufficient for MB0N output to discriminate the patterns used in training, including the face stimulus. Thus it is not clear whether the spatiotemporal input and the lobula encoding are necessary to solve these tasks.

    Thank you for your suggestion. The primary focus of this study was not to uncover the exact mechanisms of associative learning in the mushroom body (MB) but rather to evaluate the role of lobula output activity in active vision. The associative learning component was included as a simplified mechanism to assess how the spatiotemporal encoding in the lobula contributes to visual pattern learning.

    We conducted a detailed analysis of lobula neuron activity, focusing on sparsity, decorrelation, and selectivity to demonstrate how the visual system extracts compact yet relevant signals before reaching the learning centre (see Figure 5). Theoretical predictions based on these findings suggest that such encoding enhances pattern recognition performance. While selecting this possible associative learning mechanism allowed us to explicitly evaluate this capability, it also facilitated comparison with previous active vision experiments and assessed the influence of different components on bee behaviour.

    We acknowledge that recent Drosophila connectomics studies suggest alternative MB architectures, including separate MBONs encoding positive vs. negative values, learning mechanisms independent of MBON activity, and feedback from MBONs to dopaminergic pathways. However, visual learning mechanisms in the MB remain poorly characterised, especially in bees, where the functional relevance of different MBON configurations is still unclear. The decision to simplify the MB learning process was intentional, allowing us to prioritise model interpretability over anatomical replication.

    These simplifications have been explicitly discussed in the revised manuscript, where we suggest future directions for integrating more biologically detailed MB models to enhance our understanding of active visual learning in insects. For a broader discussion of our rationale for prioritising computational simplifications over direct neurobiological replication, please refer to our response to the Journal, above.

    It is also difficult to interpret the range of results in fig 3. The network sometimes learns well, sometimes just adequately (perhaps comparable to bees), and sometimes fails. The presentation of these results does not seem to identify any coherent pattern underlying success or failure, other than that the ability to generalise seems limited. That is, recognition (in most cases) requires the presentation of exactly the same stimulus in exactly the same way (same scanning pattern, distance and speed). In particular, it is hard to know what to conclude when the network appears able to learn some "complex patterns" (spirals, faces) but fails to learn the apparently simple plus vs. multiplication symbol discrimination if it is trained and tested with a scan passing across the whole pattern instead of just the lower half.

    We acknowledge that the variability in the model’s performance across different tasks and conditions required a clearer explanation. In the revised manuscript, we have analysed the underlying factors influencing success and failure in greater detail and have expanded the discussion on the model’s generalisation limitations.

    To address this, we have conducted new control experiments and deeper analyses, now presented in Figure 5, Figure 6F, which illustrate how scanning conditions impact recognition performance. Specifically, we examine why the model can successfully learn complex patterns (e.g., spirals, faces) but struggles with apparently simpler tasks, such as distinguishing between a plus and multiplication symbol when scanning the entire pattern rather than just the lower half. Our results suggest that spatially constrained scanning enhances discriminability, while whole-pattern scanning reduces selectivity due to weaker and less sparse feature encoding in lobula neurons.

    We have also clarified in the Discussion section that while the model demonstrates robust pattern learning under specific conditions, its ability to generalise remains limited when tested with compex patterns (Figure 6F. Further investigation is needed to explore how adaptive scanning strategies or hierarchical processing might improve generalisation.

    In summary, although it is certainly interesting to explore how active vision (scanning a visual pattern) might affect the encoding of stimuli and the ability to learn to discriminate rewarding stimuli, some claims in the paper need to be tempered or better supported by the demonstration that alternative, equally plausible, models of the visual and mushroom body circuits are not sufficient to solve the given tasks.

    There is limited knowledge in the literature regarding the neural correlates of visual-related plasticity in the mushroom body (MB). The majority of our current understanding of the MB is derived from studies on olfactory learning, particularly in Drosophila, which does not provide sufficient data to directly implement or comprehensively compare alternative models for visual learning.

    However, the primary focus of our study is on active vision and how spatiotemporal signals are encoded in the insect visual system. Rather than aiming to replicate a detailed biological model of MB function, we intentionally employed a simplified associative learning network to investigate how neural activity emerging from our visual processing model can support pattern recognition. This approach also allows us to compare model performance with bee behaviour, drawing on insights from previous experimental work on active vision in bees.

    We now discuss the limitations of our approach and the rationale for selectively incorporating specific neural network components in lines 652-677. Additionally, we have provided further justification (see responses above) for prioritising a simplified model, rather than attempting to mimic a highly detailed, yet currently unverified, alternative learning circuit. These clarifications help ensure that our claims are appropriately tempered while still demonstrating the functional relevance of our model.

    Reviewer #3 (Public Review):

    In this manuscript, the authors use the data collected and observations made on bees' scanning behaviour during visual learning to design a bio-inspired artificial neural network. The network follows the architecture of bees visual systems, where photoreceptors project into the lamina, then the medulla, medulla neurons connect to a set of spiking neurons in the lobula. Lobula neurons project to kenyon cells and then to MBON, which controls reward and punishment. The authors then test the performance of the network in comparison with real bee data, finding it to perform well in all tasks. The paper attempts to reproduce a living organism network with a practical application in mind, and it is quite impressive! I appreciate both the potential implications for the understanding of biological systems and the applications in the development of autonomous agents, making the paper absolutely worth reading.

    Thank you for your positive feedback and appreciation of our work.

    However, I believe that the current version somewhat lacks in clarity regarding the methodology and in some of the keywords used to describe the model.

    Definitions:
    Throughout the manuscript, the authors use some key terminology that I believe would benefit from some clarification.
    The generated model is described in the title and once in the introduction as "neuromorphic". The model is definitely bio-inspired, but at least in some layers of the neural network, the model is built very differently from actual brain connectivity. Generally, when we use the term neuromorphic we imply many advantages of neural tissue, like energy efficiency, that I am not sure the current model is achieving. I absolutely see how this work is going in that direction, and I also fundamentally agree with the choice of terminology, but this should be clearly explained to not risk over-implications

    We appreciate the reviewer’s feedback and acknowledge the importance of clarifying key terminology in our manuscript. As outlined in our response to the Journal, we intentionally simplified the model to focus on understanding the core computational processes involved in active vision rather than precisely replicating the full complexity of insect neural circuits (see other reasons for simplification in the manuscript). This simplification allows us to systematically analyse the influence of specific components underlying active vision mechanisms.

    Despite these simplifications, our model incorporates key neuromorphic principles, including the use of a recurrent neural network architecture and a spiking neuron model at multiple processing levels. These elements enable biologically inspired information processing, aligning with the fundamental characteristics of neuromorphic computing, even if the model does not explicitly focus on hardware efficiency or energy constraints.

    The authors describe this as a model of "active vision". This is done in the title of the article, and in the many paragraph headings (methods, results). In the introduction, however, the term active vision is reserved to the description of bees' behavior. Indeed, the developed model is not a model of active vision, as this would require for the model to control the movement of the "camera". Here instead the stimuli display is given to the model in a fixed progression. What I suspect is that the authors' aim is to describe a model that supports the bees' active vision, not a model of active vision. I believe this should be very clear from the paper, and it may be appropriate to remove the term from the title.

    While our model does not actively control camera movement in the environment, it does simulate the effects of active vision by incorporating scanning dynamics. Our results demonstrate that model responses change significantly with variations in scanning speed and restricted scanning areas, highlighting the importance of movement in shaping visual encoding. However, we acknowledge that true active vision would involve adaptive, real-time control of gaze or trajectory, which the step after the current implementation for make more realistic model of active vison. To address your concern, we have discussed the potential for incorporating dynamic flight behaviours in future studies, allowing the model to actively adjust its scanning strategy based on learned visual cues.

    In the short title, it said that this network is minimal. This is then characterized in the introduction as the minimal network capable of enabling active vision in bees. The authors, however, in their experiment only vary the number of lobula neurons, without changing other parts of the architecture. Given this, we can only say that 16 lobula neurons is the minimal number required to solve the experimental task with the given model. I don't believe that this is generalizable to bees, nor that this network is minimal, as there may be different architectures (for the other layers especially) that require overall less neurons. Moreover, the tasks attempted in the minimal network experiment did not include any of the complex stimuli presented in figure 3, like faces. It may be that 16 lobula neurons are sufficient for the X vs + and clockwise vs counter-clockwise spirals, but we do not know if increasing stimuli complexity would result in a failure of the model with 16 neurons.

    We agree that analysing only the number of lobula neurons is not sufficient to establish a truly minimal model for active vision. To address this, we conducted further control experiments to evaluate the influence of other key components, including non-associative learning, scanning behaviour, and lateral connectivity, on model performance. Our results suggest that the proposed model represents a computationally minimal network capable of implementing a basic active vision process, but a more complex model would be required for higher-order visual tasks.

    However, to avoid potential misinterpretation, we have revised the short title and updated the manuscript to clarify that our model identifies a possible minimal functional circuit rather than the absolute minimal network for active vision. Additionally, we have added further discussion on the simplifications made in the model and emphasised the need for future studies to explore alternative architectures and assess their relevance for understanding active vision in insects.

    Methodology:

    The current explanation of the model is currently a bit lacking in clarity and details. This risks impacting negatively on the relevance of the whole work which is interesting and worth reading! This issue affects also the interpretation of the results, as it is not clear to what extent each part of the network could affect the results shown. This is especially the case when the network under-performs with respect to the best performing scenario (e.g., when varying the speed and part of the pattern that is observed, such as in Fig 2C). Adding a detailed technical scheme/drawing specific to the network architecture could have been a way of significantly increasing the clarity of the Methods section and the interpretation of the results.
    On a similar note, the authors make some comparisons between the model and real bees. However, it remains unclear whether these similarities are actually indicative of an optimality in the bees visual scanning strategy, or just deriving from the authors design. This is for me particularly important in the experiments aimed at finding the best scanning procedure. If the initial model training is based on natural images it is performed by presenting left to right moving frames, the highest efficiency of lower-half scanning may be due to how the weights in the initial layers are structured and a low generalizability of the model, rather than to the strategy optimality

    We appreciate the reviewer’s constructive feedback and have taken steps to enhance the clarity, interpretability, and transparency of our model description and results. Below, we address the concerns regarding model explanation, performance interpretation, and the comparison with real bee behaviour.

    (1) Improved Model Explanation and Network Clarity: We apologise that the previous version of the manuscript did not fully detail the architecture and functioning of the model. To address this, we have expanded the Methods section with a more detailed breakdown of the network components, their roles, and their contribution to active vision processing. Additionally, we have summarised the network architecture and its implementation for visual learning tasks at the beginning of the Results section, providing a clearer overview of the information flow from visual input to associative learning. Furthermore, we have explicitly analysed and discussed the role of key model components, including scanning strategies, lateral connectivity, and non-associative learning mechanisms, clarifying how each contributes to the observed results.

    (2) Interpretation of Model Performance Variability: Understanding the factors influencing performance variability is crucial, and to improve clarity, we have conducted further analysis of model performance across different conditions, particularly examining the effects of scanning speed, spatial constraints, and feature encoding (see Figure 2C). Additionally, we have expanded the discussion on how scanning conditions impact performance, providing explanations for why some conditions lead to higher or lower discrimination success. Furthermore, we have clarified why certain stimuli present greater challenges for the model, linking these difficulties to receptive field properties and scanning dynamics.

    (3) Comparison Between Model Behaviour and Real Bees: To address your concern regarding the link between scanning preferences and true biological optimality, we have included further analysis discussing the influence of training conditions on the model’s learned behaviours. Additionally, we propose future experiments to test alternative scanning strategies, including adaptive scanning mechanisms that adjust based on visual task demands. Furthermore, we have expanded the discussion on the simplifications made in this study, explicitly stating the limitations of the model and emphasising the need for future research to explore more flexible and biologically plausible scanning mechanisms.

    We believe these revisions significantly enhance the clarity and interpretability of the study, ensuring that the model’s findings are well contextualised within both computational and biological frameworks.