Large Numbers of New Human Paralogs Discovered
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The identification of paralogs is critical for understanding protein evolution, function, and for drug design, yet many human proteins remain unannotated and poorly classified. Sequence-based homology detection alone often fails to detect distant paralogs, especially in the “twilight zone” or beyond, regarding sequence identity. Here we present an integrated homolog detection framework that combines results from BLASTp, MMseqs2, Foldseek, and the large protein language model-based tool PROST, followed by validation based on comparison of structures and for enzymes comparison of the specific structures of the catalytic residues. Using all-versus-all exhaustive comparisons across the 20,647 human proteins, we systematically identify novel paralogs and assess their catalytic residues for two serine protease clans. We discovered 14 previously uncharacterized human serine carboxypeptidases, validated against experimentally determined PDB structures, with 11 of these displaying conserved catalytic triads. We further identify 203 new paralogs for human kinases, with 163 of these in the major clusters that represent previously uncharacterized kinase subtypes and 30 putative novel human transcription factors. Across both serine protease subtypes, structural alignments enable the prediction of the previously unknown catalytic residues for those lacking UniProt annotations of active site residues. By integrating sequence, structure, and LPLM embedding-based approaches, the framework enables the discovery of surprisingly large numbers of unknown paralogs, permitting defining catalytic residues, and expands the understanding of protein functional landscapes. These findings provide the foundation for a large number of future functional, evolutionary, and therapeutic investigations.