Accurate and scalable multi-disease classification from adaptive immune repertoires

Natnicha Jiravejchakul
Ayan Sengupta
Songling Li
Debottam Upadhyaya
Mara A. Llamas-Covarrubias
Florian Hauer
Soichiro Haruna
Daron M. Standley

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Machine learning models trained on paratope-similarity networks have shown superior accuracy compared with clonotype-based models in binary disease classification. However, the computational demands of paratope networks hinder their use on large datasets and multi-disease classification.

Methods

We reanalyzed publicly available T cell receptor (TCR) repertoire data from 1,421 donors across 15 disease groups and a large control group, encompassing approximately 81 million TCR sequences. To address computational bottlenecks, we replaced the paratope-similarity network approach (Paratope Cluster Occupancy or PCO) with a new Fast Approximate Clustering Techniques (FACTS) pipeline, which is comprised of four main steps: (1) high-dimensional vector encoding of sequences; (2) efficient clustering of the resulting vectors; (3) donor-level feature construction from cluster distributions; and (4) gradient-boosted decision tree classification for multi-class disease prediction.

Findings

FACTS processed 10 ⁷ sequences in under 120 CPU hours. Using only TCR data, and evaluated with 5-fold cross-validation, it achieved a mean ROC AUC of 0.99 across 16 disease classes. Compared with the recently reported Mal-ID model, FACTS achieved higher donor-level classification accuracy for BCR (0.840 vs. 0.740), TCR (0.882 vs. 0.751), and combined BCR+TCR datasets (0.904 vs. 0.853) on the six-class Mal-ID benchmark. FACTS also preserved biologically meaningful signals, as shown by unsupervised t-SNE projections revealing distinct disease-associated and potentially age-associated clusters.

Interpretation

Paratope-based encoding with FACTS-derived features provides a scalable and biologically grounded approach for adaptive immune receptor (AIR) repertoire classification. The resulting classifier achieves superior multi-disease diagnostic performance while maintaining interpretability, supporting its potential for clinical and population-scale health profiling.

Funding

This study was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI [JA23H034980], the Japan Agency for Medical Research and Development (AMED) [JP25am0101001], and the Kishimoto Foundation Fellowship.

Research in context

Evidence before this study

T and B cell receptor (TCR and BCR) repertoires encode lifelong immunological memory and antigen-specific responses, making them valuable biomarkers for disease diagnosis and prediction. Existing machine learning (ML) models for adaptive immune receptor (AIR) repertoires often rely on clonotype-based representations, which limit shared receptor detection between donors and thus reduce cross-individual disease signature detection. Most models also lack robust multi-disease, population-scale performance. Our previous work showed that representing repertoires as paratope-similarity networks increased the fraction of shared receptors between donors and improved disease classification. However, their computational complexity has limited their scalability for the large datasets required in multi-disease classification.

Added value of this study

We introduce FACTS, a unified ML framework integrating paratope similarity with scalable sequence encoding. Applied to TCR repertoires from 1,421 donors across 15 diseases and one control group, FACTS maintained high performance while efficiently processing 81 million sequences on standard CPU infrastructure. Compared to Mal-ID, our paratope-encoded method achieved significantly higher donor-level accuracy and revealed biologically meaningful disease- and potentially age-associated patterns.

Implications of all the available evidence

FACTS offers high accuracy, and interpretability for multi-disease classification, bringing AIR repertoire-based diagnostics closer to clinical translation and potentially guiding precision immunotherapy and immune-based therapeutic discovery for a wide range of diseases.

Version published to 10.1101/2025.08.12.669991 on bioRxiv
Aug 16, 2025

AI-Driven Two-Component System Classifier for Pediatric MDR Pathogens

This article has 6 authors:
1. Rajeswari Rajavel
2. Dharani Pandi
3. Grahalakshmi Arunagiri
4. Prithiga Veerasamy
5. Ganesh Irisappan
6. Gurudeeban Selvaraj
This article has no evaluationsLatest version Jan 9, 2026
Unlocking the genomic landscape for antimicrobial domain discovery with a two-stage progressive residue-level annotation model

This article has 13 authors:
1. Peilin Xie
2. Xingchen Liu
3. Lantian Yao
4. Zhihao Zhao
5. Anming Yang
6. Jiahui Guan
7. Zijun Jiao
8. Zhihong Liu
9. Junwen Wang
10. Tzong-Yi Lee
11. Zigang Li
12. Bingyu Cui
13. Ying-Chih Chiang
This article has no evaluationsLatest version Dec 11, 2025
Machine Learning Models in Classifying, Predicting and Managing COVID-19 Severity

This article has 10 authors:
1. Larysa Sydorchuk
2. Maksym Sokolenko
3. Miroslav Škoda
4. Denys Nevinskyi
5. Yaroslav Vyklyuk
6. Ruslan Sydorchuk
7. Alina Sokolenko
8. Ludmila Sokolenko
9. Andrii Sydorchuk
10. Oleksandr Sokolenko
This article has no evaluationsLatest version Jan 27, 2026