A longitudinal analysis of function annotations of the human proteome reveals consistently high biases

An Phan
Parnal Joshi
Claus Kadelka
Iddo Friedberg

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The resources required to study gene function are limited, especially when considering the number of genes in the human genome and the complexity of their function. Therefore, genes are prioritized for experimental studies based on many different considerations, including, but not limited to, perceived biomedical importance, such as disease-associated genes, or the understanding of biological processes, such as cell signalling pathways. At the same time, most genes are not studied or are under-characterized, which hampers our understanding of their function and potential effects on human health and wellness. Understanding function annotation disparity is a necessary first step toward understanding how much functional knowledge is gained from the human genome, and toward guidelines for better targeting future studies of the genes in the human genome effectively. Here, we present a comprehensive longitudinal analysis of the human proteome utilizing data analysis tools from economics and information theory. Specifically, we view the human proteome as a population of proteins within a knowledge economy: we treat the quantified knowledge of the protein’s function as the analogue of wealth and examine the distribution of information in a population of proteins in the proteome in the same manner distribution of wealth is studied in societies. Our results show a highly skewed distribution of information about human proteins over the last decade, in which the inequality in the annotations given to the proteins remains high. Additionally, we examine the correlation between the knowledge about protein function as captured in databases and the interest in proteins as reflected by mentions in the scientific literature. We show a large gap between knowledge and interest and dissect the factors leading to this gap. In conclusion, our study shows that research efforts should be redirected to less studied proteins to mitigate the disparity among human proteins both in databases and literature.

Version published to 10.1093/database/baaf036
Jan 1, 2025
Version published to 10.1101/2024.10.18.619148 on bioRxiv
Oct 22, 2024

Network-based analysis of genome-wide biobank data boosts discovery of genetic associations in psoriasis

This article has 5 authors:
1. Giann Karlo Aguirre-Samboní
2. Gwenaëlle Lemoine
3. Julio Molineros
4. Florian Massip
5. Chloé-Agathe Azencott
This article has no evaluationsLatest version Mar 16, 2026
The Deep Core: Mapping the 0.91% Regulatory Backbone of the Human Proteome and Its Role in Cancer Drug Resistance

This article has 1 author:
1. Andres Pirolo
This article has no evaluationsLatest version Feb 4, 2026
On Using Large Language Models to Understand the Language of Life

This article has 2 authors:
1. Joao Pedro Magalhaes
2. George M. Church
This article has no evaluationsLatest version Feb 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Network-based analysis of genome-wide biobank data boosts discovery of genetic associations in psoriasis

The Deep Core: Mapping the 0.91% Regulatory Backbone of the Human Proteome and Its Role in Cancer Drug Resistance

On Using Large Language Models to Understand the Language of Life