Predicting causal citations without full text

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Insights from biomedical citation networks can be used to identify promising avenues for accelerating research and its downstream bench-to-bedside translation. Citation analysis generally assumes that each citation documents causal knowledge transfer that informed the conception, design, or execution of the main experiments. Citations may exist for other reasons. In this paper we identify a subset of citations that are unlikely to represent causal knowledge flow. Using a large, comprehensive feature set of open access data, we train a predictive model to identify such citations. The model relies only on the title, abstract, and reference set and not the full-text or future citations patterns, making it suitable for publications as soon as they are released, or those behind a paywall (the vast majority). We find that the model identifies, with high prediction scores, citations that were likely added during the peer review process, and conversely identifies with low prediction scores citations that are known to represent causal knowledge transfer. Using the model, we find that federally funded biomedical research publications represent 30% of the estimated causal knowledge transfer from basic studies to clinical research, even though these comprise only 10% of the literature, a three-fold overrepresentation in this important type of knowledge transfer. This finding underscores the importance of federal funding as a policy lever to improve human health.

Significance statement

Citation networks document knowledge flow across the literature, and insights from these networks are increasingly used to form science policy decisions. However, many citations are known to be not causally related to the inception, design, and execution of the citing study. This adds noise to the insights derived from these networks. Here, we show that it is possible to train a machine learning model to identify such citations, and that the model learns to identify known causal citations as well. We use this model to show that government funding drives a disproportionate amount of causal knowledge transfer from basic to clinical research. This result highlights a straightforward policy lever for accelerating improvements to human health: federal funding.

Article activity feed

  1. This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/8201231.

    This review reflects comments and contributions from Melissa Chim, Martyn Rittman, Gary McDowell, and Jessica Polka. Review synthesized by Jessica Polka. 

    Brief summary of the study

    • The study looks at whether references that directly informed how a study was carried out can be differentiated from those that are notes as examples or background knowledge. The strategy used was to build predictive models based on whether references in a journal publication were present in a preprint version of the same paper. Overall, the paper is well written and the process of building and running the model is well documented. 

    Major comments

    • The paper develops a model to test the likelihood that a reference was in a paper before and after peer review. The result is extended to suggest that the model predicts whether the reference was causal or not. Despite the follow-up examples given, the authors have not convinced me that the model can be applied to determine causal references. Before peer review, papers contain a mix of causal and non-causal references and it seems important to ask whether the model can predict non-causal references in a preprint.  

    • For validation, I would recommend a dataset of journal articles where the references have been manually marked up as causal or non-causal without knowledge of whether they were in the preprint. The validation cases chosen assume that the model is measuring what the authors intended to measure and do not rule out other explanations or explore references for which determining a causal relationship is difficult.

    • Given the human nature of science, it would be helpful to see some discussion of bias in citation and the likely related interplay between who gets cited, who gets funded, and whose work becomes established in order to be translated into an outcome. For example, we know of pervasive and growing gender biases in citations (https://www.nature.com/articles/s41593-020-0658-y) and that this is driven by the citation practices of individual researchers. Could it not be the case that discrepancies in NIH funding and network effects in citation patterns could potentially work together to lead to the foregone conclusion that clinical trials will be citing NIH-funded work because the work has become endemic, and the clinical trials themselves are not as distinct (either in funding or personal networks or scientists) from the NIH-funded research as this work seems to assume? Relatedly, authors may choose not to cite a work of competitors, a decision that could be influenced by their biased visibility of the competitor's work. 

    • The analysis assumes that preprints are posted before the peer review process, but some bioRxiv preprints are posted after peer review, either after one or more rounds of peer review at the journal in which they are ultimately published, or after rejection at another journal. 

    • Furthermore, reviews added during the review process actually have the potential to have substantially changed the work, because at the preprint stage the authors were not aware of them.

    Minor comments

    • I couldn't help but see a correlation to the work of Scite and would like to see how this paper's perspective would align or not align with Scite's work. One limitation that may need to be mentioned is that citation does not necessarily confirm that the resource was read. 

    • I am missing the link between citation patterns and federal funding/policy creation. The data sources listed may not contain a broad range of policy documents. It is also mentioned later that leading to clinical advancements is another motivator for this research. It would be helpful to have a tighter description of the goal. 

    • I would appreciate seeing the limitations discussed more upfront as I often had to go back and forth throughout the paper to see if such limitations had been addressed. 

    • Regarding the statement: "many, if not most, of citations in the scientific literature represent transfer of information that did not directly influence the inception, design, or execution of a research study." I haven't seen extensive discussion of non-causal references causing difficulties in awarding of grants. I would recommend that the authors tone down this statement or cite previous discussion of this as an issue. I think there is some work needed in setting up the premise here. I don't understand how the points in the first paragraph relate to one another: US federal agencies and federally funded investigators advance knowledge; open datasets have allowed the investigation of linked knowledge networks; the linkages in these networks can identify promising avenues of research; but this work is frustrated by most citations apparently not being causal? These things aren't linked well here, so I find I don't actually agree with the premise, and there needs to be a greater establishment of their argument of why only causal citations are desirable in this case.

    • "We confirm the inverse, that papers with high causal uncertainty have lower ranked citation rates in general" - Isn't an explanation for this that authors are more likely to be aware of highly cited papers and therefore have added them prior to peer review, regardless of whether the citation is causal or not? Authors might be more inclined to omit a citation to a less-visible paper, even if they had seen it and benefited from it.

    • "This could be interpreted as a measure of the authoritativeness of the referenced paper, since these other related works deemed it important enough to cite." How are these being dissociated from personal network effects, particularly the Matthew Effect (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4233686/)  Overall as I'm reading, a question in my mind is how the conclusion isn't just going to be explained by NIH-funded investigators being a network of people citing each other, leading to their papers becoming "endemic" (as described in the paper I linked to here) and so of course being cited and resulting in knowledge transfer to practical outcomes?

    • Figure 1 B - is there an arrow missing in B from the blue to red dot?

    • "These citations are highly suspect;" I'd suggest softening the language here. A journal on the same topic is likely to have some relevant material. While suggesting large numbers of references from the journal could be problematic, the rate of these additions should be expected to be the same as typical journal self-citations.

    • "It stands to reason that the more basic-research favored for federal funding might be less-well cited by clinical articles, due to the larger conceptual distance to applied clinical research." Are there NIH-funded researchers carrying out clinical trials, and NIH-funded clinical trials, that may be citing their own research? Does this assume that there is a group carrying out clinical trials, and a group carrying out "basic sciences" NIH-funded research, and the two are separate? Also this does not seem to be breaking out non-clinical NIH data - would there be a difference if they looked at work funded by funding opportunities that specifically exclude clinical trials?

    • "Instead, given the F1-score of 0.7, we estimate that a comparable proportion of causal citations are likely to be found in the group of citations with a low causal prediction score. We therefore refer to this group of citations as estimated causal citations." Given this definition, might another name be more appropriate?

    • "Citations in the experiment sections are mostly related to baseline comparison and contribution analysis. These kinds of citations are important to provide the readers with enough context and knowledge on the topic and bring them to the same page as the authors. However, these citations do not necessarily have intellectual impact" In general I think it is not so good to say that those citation don't (necessarily) have intellectual impact. The authors jump a bit between "causal" in a narrow and in a broad context, and between causal as defined by what one ideally would want to measure (an actual intellectual contribution) and what they actually measure (a proxy, of which we don't know how close it comes to the ideal). Especially with citations from the experimental sections as mentioned in this paragraph, the authors even use citations from methods (about fluorescent proteins) as a verification for their datasets if I am correct. IMO the discussion (and other parts where this comes up) should be very precise in the definition and delineation of what is meant by "causal" in the sense of their study as opposed to the ideal and stick to one appropriate definition. I'm very surprised to see that citations in the experimental method are being dismissed in this way, as they earlier stated that citations that would be necessary for the work to take place were causal, and by their very definition the practical techniques needed are necessary. This is introducing a new slant that is prioritizing practical necessity over intellectual necessity, and I agree that the authors need to be crystal clear about what they mean as causal.

    Comments on reporting

    • It would be helpful to see some descriptive comparisons of the citations added between preprint & final version, prior to launching into discussion of ML.

    • Do the verification datasets overlap with any of the training datasets?

    Suggestions for future studies

    • It would be interesting to see if in the future a similar study could be done with full text sources.

    • I think the authors don't try to extract actual info from the model about what parameters are important. If as is proposed, this model should be used for funding decisions, there should better be some clear discussion/warning of biases and how it does compared to other methods. Is it beneficial to sort out the "lower" 30 % / keep the upper 30% if it introduces harmful biases?

    Conflicts of interest of reviewers

    • None declared

    Competing interests

    The author declares that they have no competing interests.