GhostFold: Accurate protein structure prediction using structure-constrained synthetic coevolutionary signals
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The accuracy of protein structure prediction models such as AlphaFold2 is tightly coupled to the depth and quality of multiple sequence alignments (MSAs), posing a persistent challenge for proteins with few or no identifiable homologs. We present GhostFold, a method for conjuring structure-constrained synthetic MSAs from a single amino acid sequence, bypassing the need for traditional homology searches. Leveraging the ProstT5 protein language model and the 3Di structural alphabet, GhostFold projects a query sequence into a tokenized structural representation and iteratively back-translates to generate an ensemble of diverse, fold-consistent sequences. These synthetic alignments (pseudoMSAs) encode emergent coevolutionary constraints that are sufficient for high-accuracy structure prediction of difficult targets such as orphan proteins and hypervariable antibody loops. GhostFold consistently matches or exceeds the performance of MSA-based and language model-based structure predictors while being computationally lightweight and independent of large sequence databases. Notably, we observe a decoupling of confidence metrics (e.g., pLDDT) from prediction accuracy when using pseudoMSAs, suggesting that AlphaFold2's internal confidence calibration is strongly influenced by the statistical properties of natural sequence alignments. These results establish that structure-guided synthetic MSAs can functionally substitute for evolutionary data, offering a scalable and generalizable solution to one of the central limitations in computational structural biology. GhostFold represents a shift from passive data mining to intelligent sequence synthesis, redefining how structural priors are encoded in deep learning-based protein folding.