Assessing the potential of ancient protein sequences in the study of hominid evolution
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Peer Community in Evolutionary Biology)
Abstract
Palaeoproteomic data can provide invaluable insights into hominid evolution over long timescales. Yet, the potential and limitations of ancient protein sequences to resolve evolutionary relations between species remains largely unexplored. In this study, we aim to quantify how much information about these relations can be obtained from limited ancient protein data, at the scale that is currently available or will be available in the near future. We harness sequence alignments of 12 enamel and collagen proteins that have been previously reported in fossil material that is at least 1 million years old. We utilise in silico translations of hominid DNA sequences of these proteins and highlight their differential sequence conservation, indicating some of them contain much larger amounts of information than others. We also evaluate the extent to which inferred topologies from protein data differ from inferred topologies from the more informationally-dense DNA data. We show that the former may sometimes lead to inferences of the wrong tree topology due to the informational loss that comes when working with peptide data. Additionally, we determine the number of concatenated proteins necessary to confidently reconstruct the population / species tree summarizing the relations between humans, chimpanzees and gorillas, as well as those between modern humans, Neanderthals and Denisovans. As expected, increasing the number of proteins in a concatenation enhances resolution, but we note that trees inferred from the full set of collagen and enamel proteins do not necessarily correspond to population trees inferred from genome-wide data. We show this is especially the case in the closely related groups of our recent ancestors. We further demonstrate that while a number of proteins fall within archaic introgressed haplotypes of present day humans, ancient admixture is not the main source of the observed tree incongruence. Our study underscores the potential and limitations of utilising palaeoproteomic data in deep time phylogenetic reconstructions, indicating that these will be aided not only by increased recovery of proteins in the future, but also by more careful modeling of evolutionary relations across the genome, beyond simply building single phylogenetic trees.
Article activity feed
-
The first molecular sequences used in the study of evolution were those of proteins. These were supplanted in the 1980s and 90s by DNA sequence data, which are more informative because there are three nucleotides for every amino acid, because the genome contains vast stretches of DNA that don’t encode protein at all, and because much of this non-coding DNA evolves rapidly. Nonetheless, protein sequences are making a comeback because proteins are less fragile than DNA. It is often possible to recover protein sequences from fossil organisms that yield no DNA, and this allows us to reach back farther into the past. It also gives new life to an old question: how much phylogenetic information is there in a limited sample of proteins? This is the question that Patramanis et al. [1] address. In this effort, they have an advantage that was not …
The first molecular sequences used in the study of evolution were those of proteins. These were supplanted in the 1980s and 90s by DNA sequence data, which are more informative because there are three nucleotides for every amino acid, because the genome contains vast stretches of DNA that don’t encode protein at all, and because much of this non-coding DNA evolves rapidly. Nonetheless, protein sequences are making a comeback because proteins are less fragile than DNA. It is often possible to recover protein sequences from fossil organisms that yield no DNA, and this allows us to reach back farther into the past. It also gives new life to an old question: how much phylogenetic information is there in a limited sample of proteins? This is the question that Patramanis et al. [1] address. In this effort, they have an advantage that was not available to their predecessors of the 1960s and 70s: having access to entire genome sequences, they can estimate the phylogenetic tree very accurately. Given this “true” tree, they then ask how often the protein data lead us astray.
I will highlight two findings. The first has to do with the loss of information--which they measure as *entropy*--as one first strips the introns out of genes and then translates the codons that remain into amino acids. Patramanis et al. show that much information is lost in the first of these steps but only a little in the second. This suggests that the lower phylogenetic resolution of proteins results mainly from the absence of introns, not from translating codons into amino acids.
The second finding is fascinating, because it presents us with a puzzle. Patramanis and his colleagues study phylogenetic problems at two different time scales: the phylogeny of the great apes and humans, which has a time depth of about 6 Ma, and that of modern humans, Neanderthals, and Denisovans, which has a depth of about 0.6 Ma. At the deeper time scale, their results are unsurprising: the more proteins one uses, the greater the chance of getting the right answer. Not so however for the shallower time scale. Four proteins are better than one, but subsequent proteins yield no improvement. Indeed, each additional protein increased the support for one particular incorrect tree, in which moderns and Neanderthals are sister taxa, and Denisovans are distant relatives. I will call this the "((M,N),D) tree."
This raises the question of admixture: perhaps the copies of these genes carried by modern humans are enriched with DNA derived from admixture with Neanderthals. Patramanis et al. show that a Neanderthal haplotype in one of these genes is at elevated frequency in some populations. To control for this, they tried restricting the modern human sample to Africans, who show less evidence of archaic admixture. This did improve things a bit, but the protein data continued to support the ((M,N),D) tree.
The authors suggest that at this time scale, protein sequences are simply not very informative, and I’m sure this is true. Yet it remains puzzling that each additional protein adds support for a single incorrect tree. This suggests that some other factor may also be at work, and I will suggest one possibility. Current methods for localizing admixture within the genome work by searching for intact haplotypes derived from other populations. This works well as long as the haplotypes remain intact. But over time, recombination breaks haplotypes into smaller and smaller fragments, which eventually become undetectable. This problem is acute in regions of high recombination and when the episode of admixture is ancient. As the authors observe, two sorts of admixture could generate the ((M,N),D) tree: moderns could carry Neanderthal DNA, or Denisovans could carry DNA from a distantly-related "superarchaic" population [2–7]. The authors control for the first of these by restricting the modern human sample to Africans, but there is no obvious way to control for the second. It will be interesting, in future research, to find out whether these genes are enriched for either form of admixture, with admixed haplotypes that are too small for current methods to detect.
References
[1] Ioannis Patramanis et al. "Assessing the potential of ancient protein sequences in the study of hominid evolution". bioRxiv (2025), ver. 3 peer-reviewed and recommended by PCI Evolutionary Biology. https://doi.org/10.1101/2025.04.08.647730.
[2] Martin Kuhlwilm et al. "Ancient gene flow from early modern humans into Eastern Neanderthals". Nature 530.7591 (2016), pp. 429–433. https://doi.org/10.1038/nature16544.
[3] Kay Prüfer et al. "The complete genome sequence of a Neanderthal from the Altai Mountains". Nature 505.7481 (2014), pp. 43–49. https://doi.org/10.1038/nature12886.
[4] Kay Prüfer et al. "A high-coverage Neandertal genome from Vindija Cave in Croatia". Science 358.6363 (2017), pp. 655–658. https://doi.org/10.1126/science.aao1887.
[5] Alan R. Rogers, Nathan S. Harris, and Alan A. Achenbach. "Neanderthal-Denisovan ancestors interbred with a distantly-related hominin". Science Advances 6.8 (2020), eaay5483. https://doi.org/10.1126/sciadv.aay5483.
[6] P. J. Waddell. "Happy New Year Homo erectus? More Evidence for Interbreeding with Archaics Predating the Modern Human/Neanderthal Split". ArXiv 1312.7749 (2013). https://doi.org/10.48550/arXiv.1312.7749.
[7] Peter J Waddell, Jorge Ramos, and Xi Tan. "Homo denisova, correspondence spectral analysis, finite sites reticulate hierarchical coalescent models and the Ron Jeremy hypothesis". ArXiv 1112.6424 (2011). https://doi.org/10.48550/arXiv.1112.6424.
-
