Computational analysis of B cell receptor repertoires in COVID-19 patients using deep embedded representations of protein sequences

Abstract

Analyzing B cell receptor (BCR) repertoires is immensely useful in evaluating one’s immunological status. Conventionally, repertoire analysis methods have focused on comprehensive assessments of clonal compositions, including V(D)J segment usage, nucleotide insertions/deletions, and amino acid distributions. Here, we introduce a novel computational approach that applies deep-learning-based protein embedding techniques to analyze BCR repertoires. By selecting the most frequently occurring BCR sequences in a given repertoire and computing the sum of the vector representations of these sequences, we represent an entire repertoire as a 100-dimensional vector and eventually as a single data point in vector space. We demonstrate that this new approach enables us to not only accurately cluster BCR repertoires of coronavirus disease 2019 (COVID-19) patients and healthy subjects but also efficiently track minute changes in immune status over time as patients undergo treatment. Furthermore, using the distributed representations, we successfully trained an XGBoost classification model that achieved a mean accuracy rate of over 87% given a repertoire of CDR3 sequences.

Article activity feed

SciScore for 10.1101/2021.08.02.454701: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Embedding amino acid sequences using ProtVec: ProtVec regards amino acid 3-mers as ‘biological words’ and effectively embeds each of such words (over 9,048 words in total) into a 100-dimensional vector.	ProtVec suggested: None
Training binary classification models: In training the models, we utilized the SVM implementation from scikit learn library (version 0.22.2.post1) and XGB-Classifier from the XGBoost Python package (version 0.90).	Python suggested: (IPython, RRID:SCR_001658)
We evaluated both models using 5-fold cross validation (StratifiedKFold implementation from scikit-learn) and …

SciScore for 10.1101/2021.08.02.454701: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
Embedding amino acid sequences using ProtVec: ProtVec regards amino acid 3-mers as ‘biological words’ and effectively embeds each of such words (over 9,048 words in total) into a 100-dimensional vector.	ProtVec suggested: None
Training binary classification models: In training the models, we utilized the SVM implementation from scikit learn library (version 0.22.2.post1) and XGB-Classifier from the XGBoost Python package (version 0.90).	Python suggested: (IPython, RRID:SCR_001658)
We evaluated both models using 5-fold cross validation (StratifiedKFold implementation from scikit-learn) and ROC/AUC computation (scikit-learn).	scikit-learn suggested: (scikit-learn, RRID:SCR_002577)

Results from OddPub: Thank you for sharing your data.

Results from LimitationRecognizer: An explicit section about the limitations of the techniques employed in this study was not found. We encourage authors to address study limitations.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
No funding statement was detected.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Lucía Donnoli
M. Paula de la Guardia
Ignacio Rojas Campión
Rocío Pastor
Sofía López Cardoso
Gladys Beccaglia
Nicolás Spada
Andrea Paes Lima
M. Soledad Collado
Andrés Blanco
Estefany Cáceres
Bibiana Paoli
Daniela Hozbor
María Eugenia Amarillo
Paola Chabay
Eloísa I. Arana

Massively Scalable Single-cell Multiomic Profiling of T Cell Repertoire with REFLEX

Peter Skene
Matthew Hart
Zach Thomson
Saransh Kaul
Saskia Ilkisin
Matthieu Landreau
Tyanna Stuckey
Peter Wittig
Michael Keller
Chase McCann
Troy Torgerson

Tlr7-biallelism defines a hyperfunctional state of female B lymphocytes

Jean Charles Guery
Charles-Henry Miquel
Mélissa Nieucel
Léa Ferrayé
Remi-Xavier Coux
Anne-Laure Iscache
Claire Cenac
Berenice Faz-Lopez
Yann Aubert
Hugo Garnier
Marie-Christine BIRLING
Maxime Dubois
Vanja Sisirak
Zhaolin Hua
Baidong Hou
Magali Savignac
Julie Chaumeil

Computational analysis of B cell receptor repertoires in COVID-19 patients using deep embedded representations of protein sequences

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Dynamics of memory cell subsets in human tonsils with age: impact on the functional reconfiguration of the organ

Massively Scalable Single-cell Multiomic Profiling of T Cell Repertoire with REFLEX

Tlr7-biallelism defines a hyperfunctional state of female B lymphocytes

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Dynamics of memory cell subsets in human tonsils with age: impact on the functional reconfiguration of the organ

Massively Scalable Single-cell Multiomic Profiling of T Cell Repertoire with REFLEX

Tlr7-biallelism defines a hyperfunctional state of female B lymphocytes