MassNet: billion-scale AI-friendly mass spectral corpus enables robust de novo peptide sequencing

A Jun
Xiang Zhang
Xiaofan Zhang
Jiaqi Wei
Te Zhang
Yamin Deng
Pu Liu
Zongxiang Nie
Yi Chen
Nanqing Dong
Zhiqiang Gao
Siqi Sun
Tiannan Guo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Breakthroughs in artificial intelligence (AI) for natural language processing and computer vision have been largely driven by high-quality, large-scale datasets such as OpenWebText and ImageNet. Inspired by this, we present MassNet, a foundational resource for proteomics designed to accelerate deep learning applications. MassNet is the largest known corpus of data-dependent acquisition (DDA) mass spectrometry (MS) data, derived from ~30 TB of raw files and comprising 1.54 billion MS/MS spectra, resulting in 558 million peptide-spectrum matches (PSMs) across 35 species, including animals, plants, and microbes. Within the human subset, MassNet includes more than 1.7 million precursors and 19,966 proteins, covering 98% of annotated human proteins. To enable efficient AI training, we developed the Mass Spectrometry Data Tensor (MSDT), a structured format based on Parquet that enables standardized, high-performance batch access and seamless integration with GPU and TPU platforms for distributed training. We further extended MassNet to support de novo peptide sequencing, which infers peptide sequences directly from MS/MS spectra without reference databases, and is critical for discovering novel proteins, characterizing non-model organisms, and identifying post-translational modifications (PTMs). We introduce XuanjiNovo, a non-autoregressive Transformer model that leverages a curriculum learning strategy to enhance training stability. By dynamically adjusting learning difficulty based on model performance, XuanjiNovo achieves smooth convergence on complex, multi-distributional data without manual hyperparameter tuning. Trained on 100 million PSMs from the MassNet, it consistently outperforms state-of-the-art methods across diverse benchmarking tasks. Peptide recall exceeds 0.8 on the Bacteroides thetaiotaomicron and Zea mays datasets. On human data acquired using the Orbitrap Astral platform, XuanjiNovo achieves achieves 38.8% to 144.3% improvement over existing models. MassNet represents the first large-scale, standardized foundational dataset in proteomics, marking a critical milestone in the integration of artificial intelligence into proteomics research.

Version published to 10.1101/2025.06.20.660691 on bioRxiv
Jun 26, 2025

A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
qcMol: a large-scale dataset of 1.2 million molecules with high-quality quantum chemical annotations for molecular representation learning

This article has 3 authors:
1. Haipeng Gong
2. Haoyu Wang
3. Ziyan Zhang
This article has no evaluationsLatest version Dec 29, 2025
Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome

This article has 7 authors:
1. Valentina Carbonari
2. Annamaria Defilippo
3. Ugo Lomoio
4. Caterina Francesca Perri
5. Barbara Puccio
6. Pierangelo Veltri
7. Pietro Hiram Guzzi
This article has no evaluationsLatest version Dec 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Survey on Efficient Protein Language Models

qcMol: a large-scale dataset of 1.2 million molecules with high-quality quantum chemical annotations for molecular representation learning

Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome