Structure-Aware Mapping of Disease-Relevant Missense Variation: A Case Study in Three Nuclear Pore Complex Genes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Missense variation in the nuclear pore complex (NPC) remains difficult to interpret because sequence change, structural context, and sparse clinical labels all interact in nontrivial ways. We study three functionally distinct nucleoporins GLE1, NUP214 , and NUP62 and build a reproducible pipeline that binds variants to canonical UniProt coordinates, overlays AlphaFold2 per-residue confidence, and assigns domain/feature labels from UniProtKB/Pfam. Primary inferences rely strictly on curated Clin-Var assertions, while a separate high-confidence pseudo-labeled cohort is created for sensitivity analyses using a guarded weak-supervision scheme: a centroid-cosine scorer over handcrafted sequence-structural features is ensembled with a positive-unlabeled classifier, and only variants passing conservative probability gates are promoted. Across genes, curated data reveal coherent structure-function signals: pathogenic substitutions concentrate in specific domains and structurally ordered regions, while the pseudo-labeled cohort preserves these trends under expanded sample size without entering into hypothesis tests. The result is a transparent workflow that cleanly separates ground truth from weak supervision, avoids leakage, and produces interpretable, domainlevel effect estimates. We argue that this combination of principled labeling, structural context, and simple, auditable models offers a practical path for variant interpretation in nucleoporins and, more broadly, in proteins rich in intrinsically disordered and repeat-containing regions.

Article activity feed