ChemEmbed: A deep learning framework for metabolite identification using enhanced MS/MS data and multidimensional molecular embeddings
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Machine learning tools have become essential for annotating the vast number of unidentified MS/MS spectra in metabolomics, addressing the limitations of current reference spectral libraries. However, these tools often struggle with the high dimensionality and sparsity of MS/MS spectra and metabolite structures. ChemEmbed introduces a novel approach by combining multidimensional and continuous vector representations of chemical structures with enhanced MS/MS spectra. This enhancement is achieved by merging spectra from multiple collision energies and incorporating calculated neutral losses from 38,472 distinct compounds, providing richer input for a convolutional neural network (CNN). ChemEmbed achieves top-ranked candidate annotations in over 42% of cases and identifies the correct compound within the top five in more than 76% of cases in a test dataset. Against external benchmarks such as CASMI 2016 and 2022, ChemEmbed outperforms SIRIUS, the current state-of-the-art in computational metabolomics. In a validation experiment with the Annotated Recurrent Unidentified Spectra (ARUS) dataset— including over 25,000 spectra from human plasma and 68,000 from urine— ChemEmbed successfully identified 24 previously unannotated compounds. By aligning with the advanced capabilities of modern mass spectrometry instrumentation, ChemEmbed balances accuracy, computational efficiency, and scalability, making it a powerful solution for high-throughput metabolomics applications.