Cataloging cysteines in ECOD domains using a protein language model
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Cysteine is among the most chemically versatile residues in the proteome, existing in three competing functional states: metal coordination, covalent disulfide bonding, and bioactive free thiols. Although these states can be readily assigned from experimentally determined protein structures using simple geometric criteria, accurately annotating them from predicted structures remains challenging. To bridge the gap between predicted structures and functional interpretation, we developed TriCyP (Tri-state Cysteine Predictor), an efficient two-layer neural network built on ESM-2 protein language model embeddings. On an independent benchmark set, TriCyP achieves near-perfect accuracy (AUROC = 0.99) and outperforms existing approaches for predicting both disulfide bonding and metal coordination. We applied TriCyP to classify 2.7 million cysteine residues across 0.9 million ECOD F70 representative domains. The resulting proteome-scale landscape recapitulates established biological patterns. Cysteines are enriched in eukaryotes: disulfide-bonded states are concentrated in extracellular proteins, and metal-coordinating cysteines peak in nuclear proteins owing to the abundance of zinc-finger transcription factors. We further demonstrate the utility of cysteine-state annotation through two pilot studies. First, predicted disulfide-forming cysteines lacking a corresponding structural partner in AlphaFold models may identify either regions of elevated structural uncertainty or latent inter-protein disulfide bonds that stabilize protein-protein interactions. Second, systematic analysis of known and predicted metal-coordinating cysteines across ECOD homologous groups uncovers previously unrecognized metal-binding protein families. This proteome-wide catalog of cysteine states is available as a community resource ( http://prodata.swmed.edu/tricyp ) and will be integrated into future ECOD releases.