Pipeline for retrieval of COVID-19 immune signatures

This article has been Reviewed by the following groups

Read the full article

Abstract

The accelerating pace of biomedical publication has made retrieving papers and extracting specific comprehensive scientific information a key challenge. A timely example of such a challenge is to retrieve the subset of papers that report on immune signatures (coherent sets of biomarkers) to understand the immune response mechanisms which drive differential SARS-CoV-2 infection outcomes. A systematic and scalable approach is needed to identify and extract COVID-19 immune signatures in a structured and machine-readable format.

Materials and Methods

We used SPECTER embeddings with SVM classifiers to automatically identify papers containing immune signatures. A generic web platform was used to manually screen papers and allow anonymous submission.

Results

We demonstrate a classifier that retrieves papers with human COVID-19 immune signatures with a positive predictive value of 86%. Semi-automated queries to the corresponding authors of these publications requesting signature information achieved a 31% response rate. This demonstrates the efficacy of using a SVM classifier with document embeddings of the abstract and title, to retrieve papers with scientifically salient information, even when that information is rarely present in the abstract. Additionally, classification based on the embeddings identified the type of immune signature (e.g., gene expression vs. other types of profiling) with a positive predictive value of 74%.

Conclusion

Coupling a classifier based on document embeddings with direct author engagement offers a promising pathway to build a semistructured representation of scientifically relevant information. Through this approach, partially automated literature mining can help rapidly create semistructured knowledge repositories for automatic analysis of emerging health threats.

Article activity feed

  1. SciScore for 10.1101/2021.12.29.474353: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    NIH rigor criteria are not applicable to paper type.

    Table 2: Resources

    Experimental Models: Organisms/Strains
    SentencesResources
    SciPy1.5.4[20] was used for statistical tests.
    SciPy1.5.4
    suggested: None
    Software and Algorithms
    SentencesResources
    Generic online platform: We developed a general purpose online literature review, author solicitation, and information sharing platform powered by the Django web framework (djangoproject.com) for templating and user management, a MongoDB database backend (mongodb.com), Bootstrap (getboostrap.com) for layout, and jQuery (jquery.com) for streamlined scripting (Figure 1).
    Django
    suggested: (Django, RRID:SCR_012855)
    Two-stage SVM classifier: We developed a two-stage Support Vector Machine (SVM) based classifier for the filtered CORD-19 literature, using sklearnversion 0.24.2 [16] with Python 3.6.10.
    Python
    suggested: (IPython, RRID:SCR_001658)
    Data Analysis: Python Data Analysis Library pandas0.24.2 [18] was used to manage the data from CORD-19 and the pipeline for analysis.
    Python Data Analysis Library
    suggested: None
    Data
    suggested: None

    Results from OddPub: Thank you for sharing your code.


    Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.