Caris-ComBat-seq: Directionally Harmonizing Large-Scale RNA-seq Datasets

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Bulk RNA sequencing (RNA-seq) is an essential research and clinical diagnostics tool capable of uncovering biological insights across experimental conditions, sample types and diseases at scale. However, RNA-seq data is sensitive to batch effects introduced by technical variation. Harmonizing expression (or batch-correcting) is critical when analyzing measurements from different RNA-seq platforms. In the context of oncology, efficiently and accurately harmonizing expression at scale is important for harnessing massively large datasets (hundreds of thousands of samples) of tumor molecular profiles from different assay platforms. We aimed to develop a method for harmonizing expression in this challenging context.

Results

Here, we extend the widely used ComBat-seq method, as implemented in the pyComBat tool, to enable three key advances: (i) directionally adjusting counts from one batch towards a reference batch, rather than an average expression profile, (ii) separating the training and adjusting steps so that newly profiled samples not available at the time of initial model training can be harmonized, and (iii) flexibly handling outliers to improve the quality of harmonized counts. The resulting model correctly learns gene-specific differences between assay platforms and can near-instantaneously harmonize individual samples. We validated the use of the Caris-ComBat-seq tool to harmonize RNA-seq measurements on a benchmarking dataset of ∼10,000 TCGA tumor samples. Finally, we demonstrated its strength for very large datasets, by harmonizing RNA-seq data from nearly half a million tumor samples profiled by two different next-generation sequencing assays in Caris Life Science’s clinical laboratory. Source code, tutorials and manuscript data are available at: https://github.com/Caris-Life-Sciences/Caris-ComBat-seq and https://doi.org/10.5281/zenodo.17154014 .

Conclusions

We present Caris-ComBat-seq, a new variant of the ComBat-seq algorithm designed to harmonize count-based expression data in the context of high-throughput profiling laboratories. It offers the same ability to ameliorate batch effects and retain biological signal as the original ComBat-seq, with additional operational flexibility and scaling that benefits its high-throughput application.

Article activity feed