GLM-Prior: a genomic language model for transferable sequence-derived priors in gene regulatory network inference
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Gene regulatory network inference depends on high-quality prior knowledge, yet curated priors are often incomplete or unavailable across species and cell types. We present GLM-Prior, a genomic language model fine-tuned to predict transcription factor–target gene in-teractions directly from nucleotide sequence. We integrate GLM-Prior with PMF-GRN, a probabilistic matrix factorization model, to create a dual-stage pipeline that combines sequence-derived priors with single-cell gene expression data for GRN inference. Across six human, mouse, and yeast cell lines, GLM-Prior performance scales with positive label abundance and diverse transcription factor coverage, achieving strong accuracy in well-annotated mammalian contexts. We evaluate single-species, species-transfer learning, and multi-species training paradigms and show that GLM-Prior generalizes to held-out gene and TF sequences, enabling experiment-agnostic prior construction in previously unprofiled contexts. Furthermore, comparisons to accessibility-based priors across multiple GRN inference methods show that GLM-Prior provides the most robust priors in mammalian cell lines. Together, our results demonstrate that prior construction, rather than the choice of GRN inference algorithm, is the primary determinant of GRN inference performance, and establish GLM-Prior as a framework for building high-quality, experiment-agnostic priors that can be deployed even in understudied or experimentally inaccessible systems.