IntegrateALL: an end-to-end RNA-seq analysis pipeline for multilevel data extraction and interpretable subtype classification in B-precursor ALL

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Transcriptome sequencing (RNA-seq) is emerging as a diagnostic standard for B-cell precursor acute lymphoblastic leukemia (B-ALL). Expression-based classifiers reach ∼95% accuracy, but reproducible end-to-end solutions that also integrate transcript-derived genomic drivers and quantitative virtual karyotyping are lacking. We developed IntegrateALL, a Snakemake pipeline that standardizes RNA-seq analysis from FASTQ to rule-based subtype assignment across 26 WHO-HAEM5/ICC entities by integrating expression-based subtype prediction, gene fusion- / hotspot SNV calling and virtual karyotyping. We introduce KaryALL, a machine-learning classifier that uses normalized expression and minor-allele-frequency features (RNASeqCNV) to distinguish near haploid, hypodiploid and high hyperdiploid B-ALL and chromosome-21 gains/iAMP21 (accuracy: 0.98 / F1-score: 0.96 on 615 independent test samples). SNP-array concordance supported RNA-based karyotyping. Applied to 774 unselected B-ALL cases, IntegrateALL yielded unambiguous subtype assignments in 81.5%, based on concordance of gene expression class with a defining driver (75.3% of all cases) or, in selected cases, high-confidence expression-based classification alone (6.2%); the remainder (18.5%) were flagged for manual curation. Independent validation (3 cohorts; n=436, including pediatric cases) reproduced these distributions. Across all patients (n=1,210), 2.6% harbored two subtype defining drivers, including hyperdiploidy in fusion-driven subtypes where it was not expected or subtype-defining SNVs (e.g., PAX5 P80R / IKZF1 N159Y) co-occurring with BCR::ABL1 -positive/-like, KMT2A - or DUX4 -fusions. In most dual-driver cases, one subtype gene expression signature predominated, indicating a hierarchy of oncogenic control and the value of systematic driver screening alongside expression-based calls. IntegrateALL provides an adaptable fully reproducible workflow for molecular B-ALL characterization by systematically integrating genomic drivers and downstream gene regulation.

Article activity feed