Accurate and scalable multi-disease classification from adaptive immune repertoires
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Machine learning models trained on paratope-similarity networks have shown superior accuracy compared with clonotype-based models in binary disease classification. However, the computational demands of paratope networks hinder their use on large datasets and multi-disease classification. Methods: We reanalyzed publicly available T cell receptor (TCR) repertoire data from 1,421 donors across 15 disease groups and a large control group, encompassing approximately 81 million TCR sequences. To address computational bottlenecks, we replaced the paratope-similarity network approach (Paratope Cluster Occupancy or PCO) with a new Fast Approximate Clustering Techniques (FACTS) pipeline, which is comprised of four main steps: (1) high-dimensional vector encoding of sequences; (2) efficient clustering of the resulting vectors; (3) donor-level feature construction from cluster distributions; and (4) gradient-boosted decision tree classification for multi-class disease prediction. Findings: FACTS processed 10⁷ sequences in under 120 CPU hours. Using only TCR data, and evaluated with 5-fold cross-validation, it achieved a mean ROC AUC of 0.99 across 16 disease classes. Compared with the recently reported Mal-ID model, FACTS achieved higher donor-level classification accuracy for BCR (0.840 vs. 0.740), TCR (0.882 vs. 0.751), and combined BCR+TCR datasets (0.904 vs. 0.853) on the six-class Mal-ID benchmark. FACTS also preserved biologically meaningful signals, as shown by unsupervised t-SNE projections revealing distinct disease-associated and age-associated clusters. Interpretation: Paratope-based encoding with FACTS-derived features provides a scalable and biologically grounded approach for adaptive immune receptor (AIR) repertoire classification. The resulting classifier achieves superior multi-disease diagnostic performance while maintaining interpretability, supporting its potential for clinical and population-scale health profiling. Funding: This study was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI [JA23H034980], the Japan Agency for Medical Research and Development (AMED) [JP25am0101001], and the Kishimoto Foundation Fellowship.