IntegrateALL: an end-to-end RNA-seq analysis pipeline for multilevel data extraction and interpretable subtype classification in B-precursor ALL
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Transcriptome sequencing (RNA-seq) is emerging as a diagnostic standard for B-cell precursor acute lymphoblastic leukemia (B-ALL). Expression-based classifiers reach ∼95% accuracy, but reproducible end-to-end solutions that also integrate transcript-derived genomic drivers and quantitative virtual karyotyping are lacking. We developed IntegrateALL, a Snakemake pipeline that standardizes RNA-seq analysis from FASTQ to rule-based subtype assignment across 26 WHO-HAEM5/ICC entities by integrating expression-based subtype prediction, gene fusion- / hotspot SNV calling and virtual karyotyping. We introduce KaryALL, a machine-learning classifier that uses normalized expression and minor-allele-frequency features (RNASeqCNV) to distinguish near haploid, hypodiploid and high hyperdiploid B-ALL and chromosome-21 gains/iAMP21 (accuracy: 0.98 / F1-score: 0.96 on 615 independent test samples). SNP-array concordance supported RNA-based karyotyping. Applied to 774 unselected B-ALL cases, IntegrateALL yielded unambiguous subtype assignments in 81.5%, based on concordance of gene expression class with a defining driver (75.3% of all cases) or, in selected cases, high-confidence expression-based classification alone (6.2%); the remainder (18.5%) were flagged for manual curation. Independent validation (3 cohorts; n=436, including pediatric cases) reproduced these distributions. Across all patients (n=1,210), 2.6% harbored two subtype defining drivers, including hyperdiploidy in fusion-driven subtypes where it was not expected or subtype-defining SNVs (e.g., PAX5 P80R / IKZF1 N159Y) co-occurring with BCR::ABL1 -positive/-like, KMT2A - or DUX4 -fusions. In most dual-driver cases, one subtype gene expression signature predominated, indicating a hierarchy of oncogenic control and the value of systematic driver screening alongside expression-based calls. IntegrateALL provides an adaptable fully reproducible workflow for molecular B-ALL characterization by systematically integrating genomic drivers and downstream gene regulation.