Confronting spurious evaluations of computational methods in small molecule mass spectrometry

Vishu Gupta
Michael A. Skinnider

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Mass spectrometry-based metabolomics detects thousands of small molecule-associated signals in biological samples, but the vast majority cannot be structurally identified. Mounting interest in this metabolomic “dark matter” has spurred the development of dozens of machine-learning models for structural annotation of small molecules from their MS/MS spectra. Here, we expose a fundamental flaw in the longstanding paradigm by which these models have been evaluated. We show that a trivial machine-learning model can achieve strong performance on existing benchmarks despite wholly discarding the information contained within MS/MS spectra themselves, and without using any other auxiliary information. This performance arises because compounds with reference MS/MS spectra are structurally distinct from those found in generic chemical databases, and machine-learning models can exploit this dissimilarity by learning to predict whether a compound is likely to have been measured by MS/MS. However, we show that this confound can be overcome by using a generative model to sample decoy structures that are chemically indistinguishable from those found in reference MS/MS libraries. The resulting benchmark cannot be solved without attending to MS/MS spectra, and therefore provides an epistemologically valid framework to evaluate computational methods for the annotation of MS/MS spectra from small molecules.

Version published to 10.64898/2026.05.03.722532 on bioRxiv
May 6, 2026

Predicting Discrete Structural Transformations in Small Molecules from Tandem Mass Spectrometry

This article has 10 authors:
1. Xianghu Wang
2. Gwendolyn Kiler
3. Daniela Herrera-Rosero
4. Mohammed Reza Shahneh
5. Michael Strobel
6. Christian Geibel
7. Yasin El Abiead
8. Vanessa V. Phelan
9. Daniel Petras
10. Mingxun Wang
This article has no evaluationsLatest version May 11, 2026
Reference-free compound identification using computational prediction of molecular properties and multi-dimensional spectrometric measurements: a fentanyl case study

This article has 17 authors:
1. Christopher P. Harrilal
2. Adam L. Hollerbach
3. Danielle Ciesielski
4. Katherine J. Schultz
5. Richard Overstreet
6. Peter S. Rice
7. Ethan King
8. Julia Nguyen
9. Dylan H. Ross
10. Vivian S. Lin
11. Grace Y. Deng
12. Eva Brayfindley
13. Bobbie-Jo M. Webb-Robertson
14. Simone Raugei
15. Yehia M. Ibrahim
16. Robert G. Ewing
17. Thomas O. Metz
This article has no evaluationsLatest version Apr 27, 2026
Assessing the Generalizability of Machine Learning and Physics-Based Methods with DNA-Encoded Libraries

This article has 7 authors:
1. Marissa Dolorfino
2. Daniel Santos Perez
3. Yao Fu
4. Shu-Hang Lin
5. Sean McCarty
6. Matthew J. O’Meara
7. Terra Sztain
This article has no evaluationsLatest version Apr 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Predicting Discrete Structural Transformations in Small Molecules from Tandem Mass Spectrometry

Reference-free compound identification using computational prediction of molecular properties and multi-dimensional spectrometric measurements: a fentanyl case study

Assessing the Generalizability of Machine Learning and Physics-Based Methods with DNA-Encoded Libraries