Limitations of current machine learning models in predicting enzymatic functions for uncharacterized proteins

Valérie de Crécy-Lagard
Raquel Dias
Nick Sexson
Iddo Friedberg
Yifeng Yuan
Manal A Swairjo

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein “unknome.” This large knowledge shortfall is one of the final frontiers of biology. Machine learning (ML) approaches are enticing, with early successes demonstrating the ability to propagate functional knowledge from experimentally characterized proteins. An open question is the ability of ML approaches to predict enzymatic functions unseen in the training sets. By integrating literature and a combination of bioinformatic approaches, we evaluated individually Enzyme Commission number predictions for over 450 Escherichia coli unknowns made using state-of-the-art ML approaches. We found that current ML methods not only mostly fail to make novel predictions but also make basic logic errors in their predictions that human annotators avoid by leveraging the available knowledge base. This underscores the need to include assessments of prediction uncertainty in model output and to test for “hallucinations” (logic failures) as a part of model evaluation. Explainable artificial intelligence analysis can be used to identify indicators of prediction errors, potentially identifying the most relevant data to include in the next generation of computational models.

Version published to 10.1093/g3journal/jkaf169
Jul 24, 2025
Arcadia Science
Jul 12, 2024

Computational models could help propagate the experimentally validated functional annotations to the correct portion of the protein space

I've wondered whether there might be interesting signatures that could differentiate between 1) inappropriate transfer of functional annotations to seemingly similar proteins vs 2) incomplete annotations, i.e. where the other protein(s) may indeed have the originally hypothesized function AND a second or additional functions on top of this that confuses interpretation. Do you know of any work or models that is attempting to address this?

Read the original source
Arcadia Science
Jul 12, 2024

very few of the proteins in UniprotKB54, the most widely used protein function database55, have been linked to experimental data

Curious if you might have a ballpark number in terms of % of entries for which there is direct experimental data? I've been trying to get a sense of this and agree that it's low, but haven't been able to track down a number.

Read the original source
Version published to 10.1101/2024.07.01.601547 on bioRxiv
Jul 3, 2024

Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome

This article has 7 authors:
1. Valentina Carbonari
2. Annamaria Defilippo
3. Ugo Lomoio
4. Caterina Francesca Perri
5. Barbara Puccio
6. Pierangelo Veltri
7. Pietro Hiram Guzzi
This article has no evaluationsLatest version Dec 23, 2025
The Evolution of the AlphaFold Architecture

This article has 1 author:
1. Y.C.B.J. Dissanayaka
This article has no evaluationsLatest version Jan 9, 2026
Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features

This article has 4 authors:
1. Tayyip Topuz
2. Zeki Erdem
3. Halil Bisgin
4. E. Demet Akten
This article has no evaluationsLatest version Feb 2, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome

The Evolution of the AlphaFold Architecture

Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features