Multi-modal data integration for machine learning applications

Jacques Serizay
Romain Koszul

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The integration of multi-modal genomic data, encompassing sequences, annotations, and coverage tracks, remains a major bottleneck in bioinformatics, both for exploratory data analysis and machine learning applications. Current approaches rely on several specialized tools for different data modalities, leading to inefficient workflows and computational overhead. Here, we present momics, a unified framework to consolidate multi-omics data in a single repository and interrogate it with a high-performance query engine. Compared to existing tools, momics ingests genomic sequences, feature annotations, and unlimited coverage tracks into TileDB-backed repositories, and provides a scalable query engine for concurrent multi-modal queries across millions of genomic loci. Our benchmarks demonstrate up to 20-fold better data compression and up to 100-fold speed improvements over standard tools like pyBigWig, with a sublinear time complexity ideal for large-scale queries. Momics provides a python library optimized for exploratory data analysis and machine learning workflows, natively supporting current state-of-the-art bioinformatic ecosystems and cloud storage systems. We demonstrate momics’ utility through two real-world applications: (1) multi-modal data integration of hundreds of ChIP-seq datasets together with genomic sequence, and (2) multi-modal deep learning for chromatin accessibility prediction. By eliminating the need for multiple data parsing tools and providing a unified interface for all genomic data types, momics represents a paradigm shift in how large-scale multi-omics data can be managed and analyzed.

Key points

momics is a unified framework to consolidate sequences, annotations, and coverage tracks into a single queryable repository, addressing the critical bottleneck in genomic data analysis where researchers must juggle multiple specialized tools for different data modalities.
We show that momics can achieve up to 20-fold better data compression and 100-fold speed improvements over standard tools, with sublinear time complexity when querying millions of genomic positions simultaneously.
We use momics to formally demonstrate that multi-modal deep learning models can outperform single-modality approaches in predicting chromatin accessibility, achieving correlation of 0.84 when training with a combination of genomic sequence and MNase data.
Our results establish a new paradigm for reproducible multi-omics modeling, where entire multi-omics analysis workflows from data storage to machine learning model training can be replicated.

Version published to 10.1101/2025.10.10.681692 on bioRxiv
Oct 13, 2025

ChatMDV: Democratising Bioinformatics Analysis Using Large Language Models

This article has 9 authors:
1. Maria Kiourlappou
2. Peter Todd
3. Yaxuan Kong
4. Jayesh Hire
5. Sibgathullah Furquan Nawab Mohammed
6. Martin Sergeant
7. Stefan Zohren
8. Jim Hughes
9. Stephen Taylor
This article has no evaluationsLatest version Aug 27, 2025
Query Augmented Generation (QAG) from the Genomic Data Commons for Accurate Variant Statistics

This article has 7 authors:
1. Aarti Venkat
2. William P. Wysocki
3. Michael Lukowski
4. Steven Song
5. Anirudh Subramanyam
6. Zhenyu Zhang
7. Robert L. Grossman
This article has no evaluationsLatest version Sep 7, 2025
Evaluating Multiomics Integration Architectures for Training With Structured Missingness

This article has 9 authors:
1. Simon Fisher
2. Jacob Bradley
3. George Lansdown
4. Owen Anderson
5. Russell Hung
6. James Lesh
7. Murray Cutforth
8. Ian Poole
9. Jeremy P. Voisey
This article has no evaluationsLatest version Sep 12, 2025

Discuss this preprint

Listed in

Abstract

Key points

Article activity feed

Related articles

ChatMDV: Democratising Bioinformatics Analysis Using Large Language Models

Query Augmented Generation (QAG) from the Genomic Data Commons for Accurate Variant Statistics

Evaluating Multiomics Integration Architectures for Training With Structured Missingness