Evaluating generalizability of artificial intelligence models for molecular datasets

Yasha Ektefaie
Andrew Shen
Daria Bykova
Maximillian Marin
Marinka Zitnik
Maha Farhat

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Deep learning has made rapid advances in modeling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata based (MB) or sequence-similarity based (SB) train and test splits of input data before assessing model performance. Here, we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, i.e. , similarity between train and test splits. We introduce SPECTRA, a spectral framework for comprehensive model evaluation. For a given model and input data, SPECTRA plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability. We apply SPECTRA to 18 sequencing datasets with associated phenotypes ranging from antibiotic resistance in tuberculosis to protein-ligand binding to evaluate the generalizability of 19 state-of-the-art deep learning models, including large language models, graph neural networks, diffusion models, and convolutional neural networks. We show that SB and MB splits provide an incomplete assessment of model generalizability. With SPECTRA, we find as cross-split overlap decreases, deep learning models consistently exhibit a reduction in performance in a task- and model-dependent manner. Although no model consistently achieved the highest performance across all tasks, we show that deep learning models can generalize to previously unseen sequences on specific tasks. SPECTRA paves the way toward a better understanding of how foundation models generalize in biology.

Arcadia Science
Apr 12, 2024

All data is also available on the project Github at https://github.com/mims-harvard/SPECTRA and on Harvard Dataverse at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/W5UUNN.

This is great that all your data is available! It would be helpful to provide a LICENSE in the repo so others know the terms of reuse, and some improved documentation on how to exactly use SPECTRA for different cases - such as some of the rationale for the SP decisions are here in the discussion and could help with examples in the repo as well

Read the original source
Arcadia Science
Apr 12, 2024

We define a spectral property (SP) as a MSP expected to affect model generalizability for a specific task (e.g. 3D protein structure for protein binding prediction). The definition of the spectral property is task-specific and, together with the molecular sequence dataset and model, are the only inputs to SPECTRA

I think this should be earlier in the introduction

Read the original source
Arcadia Science
Apr 12, 2024

Main

Overall this is a really well written introduction that can be understood by a general audience! I learned a lot and also looking forward to digging into some of the cited references.

Read the original source
Arcadia Science
Apr 12, 2024

a spectral property definition

I think I'm confused on what this is supposed to be even after having finished reading this paragraph

Read the original source
Arcadia Science
Apr 12, 2024

generating a spectral performance curve (SPC). We propose the area under this curve (AUSPC)

It's pretty early in the paper and it's pretty acronym heavy, I think some of these terms like spectral performance curve and area under the curve might not need to be abbreviated since the reader will have to think back to what these terms are each time, and there is already MB and SB.

Read the original source
Arcadia Science
Apr 12, 2024

metadata-based (MB) or similarity-based (SB)

Just a small note - in the abstract SB is referred to as "sequence-similarity based" and here just similarity based, would be good to be consistent

Read the original source
Version published to 10.1101/2024.02.25.581982v1 on bioRxiv
Feb 28, 2024

VespaG: Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction

This article has 5 authors:
1. Céline Marquet
2. Julius Schlensok
3. Marina Abakarova
4. Burkhard Rost
5. Elodie Laine
This article has no evaluationsLatest version Apr 28, 2024
BetaAlign: a deep learning approach for multiple sequence alignment

This article has 7 authors:
1. Edo Dotan
2. Elya Wygoda
3. Noa Ecker
4. Michael Alburquerque
5. Oren Avram
6. Yonatan Belinkov
7. Tal Pupko
This article has no evaluationsLatest version Apr 3, 2024
Generative Models for Prediction of Non-B DNA Structures

This article has 2 authors:
1. Oleksandr Cherednichenko
2. Maria Poptsova
This article has no evaluationsLatest version Mar 28, 2024

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

VespaG: Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction

BetaAlign: a deep learning approach for multiple sequence alignment

Generative Models for Prediction of Non-B DNA Structures