Multi-modal data integration for machine learning applications
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The integration of multi-modal genomic data, encompassing sequences, annotations, and coverage tracks, remains a major bottleneck in bioinformatics, both for exploratory data analysis and machine learning applications. Current approaches rely on several specialized tools for different data modalities, leading to inefficient workflows and computational overhead. Here, we present momics, a unified framework to consolidate multi-omics data in a single repository and interrogate it with a high-performance query engine. Compared to existing tools, momics ingests genomic sequences, feature annotations, and unlimited coverage tracks into TileDB-backed repositories, and provides a scalable query engine for concurrent multi-modal queries across millions of genomic loci. Our benchmarks demonstrate up to 20-fold better data compression and up to 100-fold speed improvements over standard tools like pyBigWig, with a sublinear time complexity ideal for large-scale queries. Momics provides a python library optimized for exploratory data analysis and machine learning workflows, natively supporting current state-of-the-art bioinformatic ecosystems and cloud storage systems. We demonstrate momics’ utility through two real-world applications: (1) multi-modal data integration of hundreds of ChIP-seq datasets together with genomic sequence, and (2) multi-modal deep learning for chromatin accessibility prediction. By eliminating the need for multiple data parsing tools and providing a unified interface for all genomic data types, momics represents a paradigm shift in how large-scale multi-omics data can be managed and analyzed.
Key points
-
momics is a unified framework to consolidate sequences, annotations, and coverage tracks into a single queryable repository, addressing the critical bottleneck in genomic data analysis where researchers must juggle multiple specialized tools for different data modalities.
-
We show that momics can achieve up to 20-fold better data compression and 100-fold speed improvements over standard tools, with sublinear time complexity when querying millions of genomic positions simultaneously.
-
We use momics to formally demonstrate that multi-modal deep learning models can outperform single-modality approaches in predicting chromatin accessibility, achieving correlation of 0.84 when training with a combination of genomic sequence and MNase data.
-
Our results establish a new paradigm for reproducible multi-omics modeling, where entire multi-omics analysis workflows from data storage to machine learning model training can be replicated.