Sample-specific haplotype-resolved protein isoform characterization via long-read RNA-seq-based proteogenomics

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Protein isoform inference from bottom-up mass spectrometry (MS) relies on database search strategies that assume the reference protein database accurately reflects the full repertoire of genetic and transcriptomic states present in the sample being analyzed. Long-read RNA sequencing (lrRNA-seq) now enables simultaneous recovery of complete transcript structures and the genetic variants present on each molecule, offering a direct route to allele-specific isoforms, yet this capability has not been fully leveraged to improve MS-based proteogenomics workflows. Here, we develop an end-to-end workflow for constructing and searching haplotype-resolved, sample-specific proteomes using matched lrRNA-seq and MS data. We benchmark multiple phasing algorithms on PacBio lrRNA-seq from Genome-in-a-Bottle samples and identify methods that achieve high phasing accuracy and completeness on transcriptomic reads. Our open-source, modular Snakemake pipeline performs variant calling, read-based phasing, isoform discovery, haplotype-resolved proteome construction, MS search, and downstream annotation. To demonstrate its utility, we apply the workflow to an induced pluripotent stem cell line (WTC11) and to an osteoblast differentiation time course, showing that haplotype-resolved databases enable detection of variant and splice peptides, allele-specific protein isoforms, and linked variants not detectable with reference-only proteomes. Together, our results demonstrate that lrRNA-seq-based phasing is feasible and effective for proteogenomics and provide a practical framework for allele-resolved proteome characterization in dynamic or disease-relevant settings.

Article activity feed