Computational analysis of B cell receptor repertoires in COVID-19 patients using deep embedded representations of protein sequences
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (ScreenIT)
Abstract
Analyzing B cell receptor (BCR) repertoires is immensely useful in evaluating one’s immunological status. Conventionally, repertoire analysis methods have focused on comprehensive assessments of clonal compositions, including V(D)J segment usage, nucleotide insertions/deletions, and amino acid distributions. Here, we introduce a novel computational approach that applies deep-learning-based protein embedding techniques to analyze BCR repertoires. By selecting the most frequently occurring BCR sequences in a given repertoire and computing the sum of the vector representations of these sequences, we represent an entire repertoire as a 100-dimensional vector and eventually as a single data point in vector space. We demonstrate that this new approach enables us to not only accurately cluster BCR repertoires of coronavirus disease 2019 (COVID-19) patients and healthy subjects but also efficiently track minute changes in immune status over time as patients undergo treatment. Furthermore, using the distributed representations, we successfully trained an XGBoost classification model that achieved a mean accuracy rate of over 87% given a repertoire of CDR3 sequences.
Article activity feed
-
SciScore for 10.1101/2021.08.02.454701: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources Embedding amino acid sequences using ProtVec: ProtVec regards amino acid 3-mers as ‘biological words’ and effectively embeds each of such words (over 9,048 words in total) into a 100-dimensional vector. ProtVecsuggested: NoneTraining binary classification models: In training the models, we utilized the SVM implementation from scikit learn library (version 0.22.2.post1) and XGB-Classifier from the XGBoost Python package (version 0.90). Pythonsuggested: (IPython, RRID:SCR_001658)We evaluated both models using 5-fold cross validation (StratifiedKFold implementation from scikit-learn) and … SciScore for 10.1101/2021.08.02.454701: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
Software and Algorithms Sentences Resources Embedding amino acid sequences using ProtVec: ProtVec regards amino acid 3-mers as ‘biological words’ and effectively embeds each of such words (over 9,048 words in total) into a 100-dimensional vector. ProtVecsuggested: NoneTraining binary classification models: In training the models, we utilized the SVM implementation from scikit learn library (version 0.22.2.post1) and XGB-Classifier from the XGBoost Python package (version 0.90). Pythonsuggested: (IPython, RRID:SCR_001658)We evaluated both models using 5-fold cross validation (StratifiedKFold implementation from scikit-learn) and ROC/AUC computation (scikit-learn). scikit-learnsuggested: (scikit-learn, RRID:SCR_002577)Results from OddPub: Thank you for sharing your data.
Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- No funding statement was detected.
- No protocol registration statement was detected.
Results from scite Reference Check: We found no unreliable references.
-
