Evaluating generalizability of artificial intelligence models for molecular datasets

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Deep learning has made rapid advances in modeling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata based (MB) or sequence-similarity based (SB) train and test splits of input data before assessing model performance. Here, we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, i.e. , similarity between train and test splits. We introduce SPECTRA, a spectral framework for comprehensive model evaluation. For a given model and input data, SPECTRA plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability. We apply SPECTRA to 18 sequencing datasets with associated phenotypes ranging from antibiotic resistance in tuberculosis to protein-ligand binding to evaluate the generalizability of 19 state-of-the-art deep learning models, including large language models, graph neural networks, diffusion models, and convolutional neural networks. We show that SB and MB splits provide an incomplete assessment of model generalizability. With SPECTRA, we find as cross-split overlap decreases, deep learning models consistently exhibit a reduction in performance in a task- and model-dependent manner. Although no model consistently achieved the highest performance across all tasks, we show that deep learning models can generalize to previously unseen sequences on specific tasks. SPECTRA paves the way toward a better understanding of how foundation models generalize in biology.

Article activity feed

  1. All data is also available on the project Github at https://github.com/mims-harvard/SPECTRA and on Harvard Dataverse at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/W5UUNN.

    This is great that all your data is available! It would be helpful to provide a LICENSE in the repo so others know the terms of reuse, and some improved documentation on how to exactly use SPECTRA for different cases - such as some of the rationale for the SP decisions are here in the discussion and could help with examples in the repo as well

  2. We define a spectral property (SP) as a MSP expected to affect model generalizability for a specific task (e.g. 3D protein structure for protein binding prediction). The definition of the spectral property is task-specific and, together with the molecular sequence dataset and model, are the only inputs to SPECTRA

    I think this should be earlier in the introduction

  3. Main

    Overall this is a really well written introduction that can be understood by a general audience! I learned a lot and also looking forward to digging into some of the cited references.

  4. generating a spectral performance curve (SPC). We propose the area under this curve (AUSPC)

    It's pretty early in the paper and it's pretty acronym heavy, I think some of these terms like spectral performance curve and area under the curve might not need to be abbreviated since the reader will have to think back to what these terms are each time, and there is already MB and SB.

  5. metadata-based (MB) or similarity-based (SB)

    Just a small note - in the abstract SB is referred to as "sequence-similarity based" and here just similarity based, would be good to be consistent