Computer code comprehension shares neural resources with formal logical inference in the fronto-parietal network

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Despite the importance of programming to modern society, the cognitive and neural bases of code comprehension are largely unknown. Programming languages might ‘recycle’ neurocognitive mechanisms originally used for natural languages. Alternatively, comprehension of code could depend on fronto-parietal networks shared with other culturally derived symbol systems, such as formal logic and math. Expert programmers (average 11 years of programming experience) performed code comprehension and memory control tasks while undergoing fMRI. The same participants also performed language, math, formal logic, and executive control localizer tasks. A left-lateralized fronto-parietal network was recruited for code comprehension. Patterns of activity within this network distinguish between “for” loops and “if” conditional code functions. Code comprehension overlapped extensively with neural basis of formal logic and to a lesser degree math. Overlap with simpler executive processes and language was low, but laterality of language and code covaried across individuals. Cultural symbol systems, including code, depend on a distinctive fronto-parietal cortical network.

Article activity feed

  1. ###Reviewer #3:

    This fMRI study examines an interesting question, namely how computer code - as a "cognitive/cultural invention" - is processed by the human brain. However, I have a number of concerns with regard to how this question was examined in terms of experimental design, including the choice of control condition (fake code) and the way in which localiser tasks were utilised. In addition, the sample size is very small (n=15) and there appear to be large inter-individual differences in coding performance (in spite of the recruitment of expert programmers). In summary, while promising in its aims, the study's conclusions are weakened by these considerations related to its execution.

    1. The control condition

    The experiment contrasted real Python code with fake code in the form of "incomprehensible scrambled Python functions". Real and fake code also differed in regard to the task performed (code comprehension versus memory) and were distinguished via colour coding. There is a lot to unpack here in regard to how processing might differ between the two different conditions. For example, the real code blocks required code comprehension as well as computational problem solving (which does not necessarily require the use of code), while the control task requires neither. As a result of the colour coding, it also appears likely that participants will have approached the fake code blocks with a completely different processing strategy than the real code blocks. These are just a few obvious differences between the conditions but there are likely many more given how different they are. This, in my view, makes it difficult to interpret the basic contrast between real and fake code.

    1. Use of localiser tasks

    A similar concern as for point 1 holds in regard to the localiser tasks that were used in order to examine anatomical overlap (or lack thereof) between code comprehension and language, maths, logical problem solving and multiple-demand executive control, respectively. I am generally somewhat sceptical in regard to the use of functional localisers in view of the assumptions that necessarily enter into the definition of a localiser task. This concern is exacerbated by the way in which localisers were employed in the present study. Firstly, in addition to the definition of the localiser task itself, this study used localiser contrasts to define networks of interest. For example, the contrast language localiser > maths localiser served to define the "language network". Thus, assumptions about the nature of the localiser itself are compounded with those regarding the nature of the contrast. Secondly, particularly with regard to language, the localiser task was very high level, i.e. requiring participants to judge whether an active and a passive sentence had the same meaning (with both statements remaining on the screen at the same time). While of course requiring language processing, this task is arguably also a problem solving task of sorts. It is certainly more complex than a typical task designed to probe fast and automatic aspects of natural language processing.

    In addition, given that reading is also a cultural invention, is it really fair to say that coding is being compared to the "language network" here rather than to the "reading network" (in view of the visual presentation of the language task)? The possible implications of this for the interpretation of the data should be considered.

    More generally, while an anatomical overlap between networks active during code comprehension and networks recruited during other cognitive tasks may shed some initial light on how the brain processes code, it doesn't support any particularly strong conclusions about the neural mechanisms of code processing in my view. While code comprehension may overlap anatomically with regions involved in executive control and logic, this doesn't mean that the same neuronal populations are recruited in each task nor that the processing mechanisms are comparable between tasks.

    1. Sample size and individual differences

    At n=15, the sample size of this study is quite small, even for a neuroimaging study. This again limits the conclusions that can be drawn from the study results.

    Moreover, the results of the behavioural pre-test - which was commendably included - suggest that participants differed considerably with regard to their Python expertise. For the more difficult exercise in this pre-test, the mean accuracy score was 64.6% with a range from 37.5% to 93.75%. These substantial differences in proficiency weren't taken into account in the analysis of the fMRI data and, indeed, it appears difficult to meaningfully do so in view of the sample size.

  2. ###Reviewer #2:

    The goal of this fMRI study was to determine which brain systems support coding, by way of the extent of overlap of univariate maps with localizer tasks for language, logic, math, and executive functions. The basic conclusion is one we could have anticipated: coding engages a widespread frontoparietal network, with stronger involvement of the left hemisphere. It overlaps with all of the other tasks, but most with the map for logic. This doesn't seem too surprising, but the authors argue convincingly that others wouldn't have predicted that.

    It's unfortunate that there are differences in task difficulty among the tasks - in particular, that the logic task was the most difficult of all (both in terms of accuracy and response times), since that happens to be the one that had the largest number of overlapping voxels with the coding task. We can't know whether coding and language task voxels would have overlapped more if the language task had been more difficult.

    It seems a shame to present data only from highly experienced coders (11+ years of experience); I can imagine that the investigators are planning to write up another study examining effects of expertise, in comparison with less experienced coders. This seems like an initial paper that's laying the groundwork for a more groundbreaking one.

  3. ###Reviewer #1:

    This manuscript is clearly written and the methods appear to be rigorous, although the number of subjects (15) is a bit low; however, this does not appear to critically limit interpretation of the results. I appreciated the focused inclusion on expert coders to make a clear comparison to language. I also thought that the inclusion of multiple domains for comparison (logic, math, executive function, and language) was quite informative. The laterality covariance between code and language was also quite interesting. I do have some concerns with the literature review and discussion of present and previous results.

    1. My main concern with this paper is that it does not clearly review previous fMRI studies on code processing. How do the present results compare with previous studies? E.g. Castelhano et al., 2019; Floyd et al., 2017; Huang et al., 2019; Krueger et al., 2020; Siegmund et al., 2017, 2014;) It seems like the localization/lateralization obtained in the present study is largely similar to these previous studies (e.g. Siegmund et al., 2017). If so, this should be discussed: a convergence across multiple methods/authors is useful to know. Any discrepancies are also useful to know. The authors suggest that "Moreover, no prior study has directly compared the neural basis of code to other cognitive domains." However, Krueger et al. (2020) and Huang et al. (2019) appear to have done this.

    2. The authors should point out and discuss the difficulty of understanding the psychological and neural structure of coding in absence of a clear theory of coding, as is the case for language (e.g. Chomsky, 1965; Levelt, 1989; Lewis & Vasishth, 2005). On this point, I appreciate the reference to Fitch et al. (2005) regarding recursion in coding, but I think it would be most helpful to have a clear example of recursion in python code. However, the authors at least focus their results on neural underpinnings without attempting to make strong claims about cognitive underpinnings.

    3. The authors report overlap between code comprehension and language in the posterior MTG and IFG. They note that these activations were somewhat inconsistent; yet, they did observe this significant overlap. However the paper discusses the results as if this overlap did not occur, e.g. "We find that the perisylvian fronto-temporal network that is selectively responsive to language, relative to math, does not overlap with the neural network involved in code comprehension." This is not accurate, as there indeed was overlap. It is important to point out that among language-related regions, these two regions are the most strongly associated with abstract syntax (Friederici, 2017; Hagoort, 2005; Tyler & Marslen-Wilson, 2008; Pallier et al., 2011; Bornkessel-Schlesewsky & Schlesewsky, 2013; Matchin & Hickok, 2019), which very well could be a point of shared resources among code and language (as discussed in Fitch, 2005).

  4. ##Preprint Review

    This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 3 of the manuscript.

    This was co-submitted with the following manuscript: https://www.biorxiv.org/content/10.1101/2020.04.16.045732v1

    ###Summary:

    The remit of the co-submission format is to ask if the scientific community is enriched by the data presented in the co-submitted manuscripts together more so than it would be by the papers apart, or if only one paper was presented to the community. In other words, are the conclusions that can be made stronger or clearer when the manuscripts are considered together rather than separately? We felt that despite significant concerns with each paper individually, especially regarding the theoretical structures in which the experimental results could be interpreted, that this was the case.

    We want to be very clear that in a non-co-submission case we would have substantial and serious concerns about the interpretability and robustness of the Liu et al. submission given its small sample size. Furthermore, the reviewers' concerns about the suitability of the control task differed substantially between the manuscripts. We share these concerns. However, despite these differences in control task and sample size, the Liu et al. and Ivanova et al. submissions nonetheless replicated each other - the language network was not implicated in processing programming code. The replication substantially mitigates the concerns shared by us and the reviewers about sample size and control tasks. The fact that different control tasks and sample sizes did not change the overall pattern of results, in our view, is affirmation of the robustness of the findings, and the value that both submissions presented together can offer the literature.

    In sum, there were concerns that both submissions were exploratory in nature, lacking a strong theoretical focus, and relied on functional localizers on novel tasks. However, these concerns were mitigated by the following strengths. Both tasks ask a clear and interesting question. The results replicate each other despite task differences. In this way, the two papers strengthen each other. Specifically, the major concerns for each paper individually are ameliorated when considering them as a whole.

    The concerns of the reviewers need addressing, including, specifically, the limits of interpretation of your results with regard to control task choice, the discussion of relevant literature mentioned by the reviewers, and most crucially, please contextualize your results with regard to the other submission's results.