Large Numbers of New Human Paralogs Discovered

BK Pradeep
Weixia Deng
Mahsa Askary Hemmat
Issak Daniels
Robert L. Jernigan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The identification of paralogs is critical for understanding protein evolution, function, and for drug design, yet many human proteins remain unannotated and poorly classified. Sequence-based homology detection alone often fails to detect distant paralogs, especially in the “twilight zone” or beyond, regarding sequence identity. Here we present an integrated homolog detection framework that combines results from BLASTp, MMseqs2, Foldseek, and the large protein language model-based tool PROST, followed by validation based on comparison of structures and for enzymes comparison of the specific structures of the catalytic residues. Using all-versus-all exhaustive comparisons across the 20,647 human proteins, we systematically identify novel paralogs and assess their catalytic residues for two serine protease clans. We discovered 14 previously uncharacterized human serine carboxypeptidases, validated against experimentally determined PDB structures, with 11 of these displaying conserved catalytic triads. We further identify 203 new paralogs for human kinases, with 163 of these in the major clusters that represent previously uncharacterized kinase subtypes and 30 putative novel human transcription factors. Across both serine protease subtypes, structural alignments enable the prediction of the previously unknown catalytic residues for those lacking UniProt annotations of active site residues. By integrating sequence, structure, and LPLM embedding-based approaches, the framework enables the discovery of surprisingly large numbers of unknown paralogs, permitting defining catalytic residues, and expands the understanding of protein functional landscapes. These findings provide the foundation for a large number of future functional, evolutionary, and therapeutic investigations.

Version published to 10.1101/2025.10.15.680306 on bioRxiv
Oct 15, 2025

Unique Super-Secondary Structures for Novel Leucine-Rich Repeats in Many Proteins from the Bacterial PVC Superphylum

This article has 3 authors:
1. Norio Matsushima
2. Dashdavaa Batkhishig
3. Purevjav Enkhbayar
This article has no evaluationsLatest version Jan 27, 2026
The Evolution of the AlphaFold Architecture

This article has 1 author:
1. Y.C.B.J. Dissanayaka
This article has no evaluationsLatest version Jan 9, 2026
The Deep Core: Mapping the 0.91% Regulatory Backbone of the Human Proteome and Its Role in Cancer Drug Resistance

This article has 1 author:
1. Andres Pirolo
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Unique Super-Secondary Structures for Novel Leucine-Rich Repeats in Many Proteins from the Bacterial PVC Superphylum

The Evolution of the AlphaFold Architecture

The Deep Core: Mapping the 0.91% Regulatory Backbone of the Human Proteome and Its Role in Cancer Drug Resistance