Unifying the known and unknown microbial coding sequence space

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This paper tackles perhaps THE central question in metagenomics: what are all these unknown genes and genomes doing!? The authors use recent advances in high-throughput sequencing clustering and homology detection algorithms to systematically integrate unannotated genes into discovery workflows. The paper's exploration results in a wide array of highly informative summative statistics, together with a simple example of how powerful the provided resource can be in generating hypotheses about the function of unknown genes.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40–60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71 M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand . Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data.

Article activity feed

  1. Author Response:

    Reviewer #2:

    Weaknesses:

    • The priority given to metagenomic protein sequences over reference genome sequences in the clustering pipeline is not sufficiently justified. Indeed, the metagenomic coding sequences are notably more likely to be fragmented due to challenges in assembly. A combined clustering of both would present a conceptually simpler and potentially less biased workflow. Likewise, the conceptual division between genomic and environmental genes belies their mutual importance in discovering unknown functions.

    We explained better in the text the rationale for the different decisions we took. Briefly, by using metagenomic data instead of references as initial data, we can show the robustness of AGNOSTOS to deal with noisy and incomplete data. Most of the studies that will use our methods will use data derived from metagenomes (contigs or MAGs), and it is crucial to show that our validation and refinement steps perform as expected. Later, we added the GTDB sequences to show the capabilities of AGNOSTOS to enrich already processed data. The results of clustering both data sets together or updating the existing gene clusters will be almost identical, but by doing it in two steps, one can track the dynamics of the singletons, the stability of the gene clusters and many other interesting processes that can provide a better understanding of the data. Also one can “paint” the gene clusters by integrating other sources of data, like enzymatic sequences like we did in Dittmar et al., 2021 where we integrated CAZymes, KEGG and other data sources in the seed database used in this manuscript.

    • The authors do not compare their methods to other possible ways to identify the unknown fraction. It is therefore unclear how much better than a naive approach it might be. Likewise, it is worthwhile to question the sensitive of their results to analysis parameters. As a suggestive example, in the one case where they did compare possible parameter values-the systematic selection of the inflation parameter for MCL clustering of gene clusters into super-clusters (Supplemental Figure 7-1)-the selected values resulted in distinctly different super-cluster properties compared to all other assessed parameter values. The manuscript would be strengthened by highlighting how the chosen parameters maximize sensitivity to remote homology.

    We moved and expanded from the supplementary our comparison against FunkFams and show that many of the FunkFam families belong to the known coding sequence space. In addition, we expanded the section where we reanalyze the data from Salazar et al., 2019 where they used eggNOG to explore the known and unknown fractions of the OM-RGCv2. Here we show how a large proportion of the genes classified under the category [S] by eggnog-mapper correspond to our known fraction. For the remote homology searches we used the cut-offs recommended by the authors of HHblits (https://github.com/soedinglab/hh-suite/wiki#how-can-i-verify-if-a-database-match-is-homologous)

    • It is not clear why super-clusters ("cluster communities") are identified within each of the cluster classifications (Known / Genomic Unknown / etc.) instead of across all four groups. Intuitively, this would present the opportunity to detect distant homology between clusters with known and unknown function.

    We improved the part where we explain why we perform the identification of the gene cluster communities by category and not combining all gene clusters. Briefly, as we are dealing with the unknown, we need to have a reference to evaluate the quality of the MCL clustering. The reference is the known fraction, as we can exploit the information related to the domain architectures to fine tune the parameter selection and avoid over splitting or lumping gene clusters together. Then we use the learnt parameters to partition the other categories where we lack the domain architecture information. One can identify the relationships between known and unknown gene clusters a posteriori, combining the sequences of a gene cluster community and creating a new HMM profiles that can be used to link known with unknowns.

    • It is not clear why small clusters and those with many fragmented members are removed entirely from downstream analyses, given that the inclusion of additional sequences in later steps would presumably improve the quality of these clusters by adding new representatives.

    We included in the main text parts of the Supp. Note 11 to improve the explanation of how we handle singletons. Singletons or those gene clusters that didn’t pass the validation process are not removed but flagged, so the user is aware that these gene clusters might be problematic. In the manuscript, depending on the downstream analysis, we keep or remove them, it depends on the question we want to answer, the user can decide what to do. For example, for the collector curves shown in Figure 3, we use a subset of the singletons. These singletons are selected based on the information we gathered by integrating GTDB in the metagenomic dataset combined with the inferred gene abundances from the metagenomes. As explained in the methods section, we removed the singletons from the metagenomic dataset with an abundance smaller than the modal abundance of the singletons that got reclassified as good-quality clusters after integrating the GTDB data to minimize the impact of potential spurious singletons. We clarified this point in the main text to avoid confusion

    • While maximizing sensitivity to remote homology is appropriate for the overarching goal of characterizing entirely unknown protein clusters, the likely decrease in specificity means that the accuracy of functional annotations and the shared function of all sequences in a cluster are suspect (as the authors are aware). It would have been interesting and valuable to extend the hierarchical clustering framework, already partially developed here, to enable both sensitive and specific annotations.

    Now, we explicitly stated in the main that we don’t perform any functional annotation besides PFAM as this is not our purpose. We believe that predicting function from sequence similarity methods is not a trivial task and, in many cases, it might be wrong. Related to the last comment of the reviewer, as shown in Figure 1C, this is possible with the current implementation of AGNOSTOS, as one can use the most adequate level of resolution. Our gene clustering and gene cluster community inference are highly constrained to preserve domain architectures. With this approach we have a good balance between sensitivity and specificity, although some noise is expected as we show in Figure2C-D. To support this, we included a new table and figure in Supplementary, where we evaluated the entropy of the eggNOG annotations at the gene cluster and gene cluster community level. In both cases the entropy values are very low.

  2. Evaluation Summary:

    This paper tackles perhaps THE central question in metagenomics: what are all these unknown genes and genomes doing!? The authors use recent advances in high-throughput sequencing clustering and homology detection algorithms to systematically integrate unannotated genes into discovery workflows. The paper's exploration results in a wide array of highly informative summative statistics, together with a simple example of how powerful the provided resource can be in generating hypotheses about the function of unknown genes.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)

  3. Reviewer #1 (Public Review):

    The main question being tackled in this paper is, how do you include the unknown genes from metagenomes in analytical workflows?

    To that end, the authors quantify the unknown fraction of genes in both genomes and metagenomes, and compare and contrast them across human-associated and marine environments.

    The framing of the problem in the introduction, the discussion of the results, and the thinking about next steps, are particularly well done!

    The methodology employed to generate the results, and the specific results, are high quality; and the implications for the field of both the workflows and the resulting database are immense (and clearly well understood by the authors). This is likely to spur many in-depth explorations that make use of the hypotheses that can now casually be generated.

    Where I think the paper needs the most work is in connecting the results to the discussion. I believe all the pieces are there, but it is hard to sort through the (many) fascinating observations made by the authors and connect them clearly to the discussion.

  4. Reviewer #2 (Public Review):
    Vanni and colleagues set out to catalog the sequence diversity and distribution of proteins identified in metagenomic data where standard methods are unable to assign functional annotations. The authors perform homology based clustering on a large collection of putative protein coding genes from metagenomic assemblies, with a focus on the HMP and TARA Oceans Survey datasets. By taking a very high-sensitivity, multi-method approach to annotating gene clusters, only clusters without detectable homology are annotated as "unknown". Their pipeline, which is built using Snakemake, involves domain annotation with Pfam/HMMER3, clustering of sequences with MMseqs, remote homology detection using HHBlits, and further grouping sequence clusters into super-clusters using MCL. The authors find that, in metagenomic assemblies the contribution of the unknown fraction to the pool of all genes is smaller than one might have been expected, and is dependent on the source of the environmental sample. Nonetheless, the ad hoc clustering of sequences into (operational) protein families shows that the unknown fraction has a very large number of potential functions, and that still more will be discovered with additional samples. Based on an analysis of taxonomic distribution, they find that the unknown fraction is largely composed of gene families that are clade specific, especially at the level of species.

    By de novo clustering putative coding sequences, with a particular eye to identifying truly unknown protein families, the authors demonstrate the value of recently developed, scalable computational methods paired with the explosion of metagenomic data towards increasing the pace of microbial functional gene discovery.

    Strengths:

    - The authors take a systematic and reproducible approach to integrating data from a large corpus of metagenomic libraries and reference databases. By de novo clustering the authors are able to improve the sensitivity of their homology detection, while providing an extendable database of sequence diversity.

    - This manuscript explores some interesting ideas about how we might structure a database of both near and remote sequence homology, specifically the use of super-clusters ("communities of gene clusters").

    - Uniquely, analysis parameters were chosen conservatively to minimize false negatives in homology detection. As a result, their unknown fraction is a convincing representation of the huge diversity of protein families for which functions have confidently *not* been characterized.

    Weaknesses:

    - The priority given to metagenomic protein sequences over reference genome sequences in the clustering pipeline is not sufficiently justified. Indeed, the metagenomic coding sequences are notably more likely to be fragmented due to challenges in assembly. A combined clustering of both would present a conceptually simpler and potentially less biased workflow. Likewise, the conceptual division between genomic and environmental genes belies their mutual importance in discovering unknown functions.

    - The authors do not compare their methods to other possible ways to identify the unknown fraction. It is therefore unclear how much better than a naive approach it might be. Likewise, it is worthwhile to question the sensitive of their results to analysis parameters. As a suggestive example, in the one case where they did compare possible parameter values-the systematic selection of the inflation parameter for MCL clustering of gene clusters into super-clusters (Supplemental Figure 7-1)-the selected values resulted in distinctly different super-cluster properties compared to all other assessed parameter values. The manuscript would be strengthened by highlighting how the chosen parameters maximize sensitivity to remote homology.

    - It is not clear why super-clusters ("cluster communities") are identified within each of the cluster classifications (Known / Genomic Unknown / etc.) instead of across all four groups. Intuitively, this would present the opportunity to detect distant homology between clusters with known and unknown function.

    - It is not clear why small clusters and those with many fragmented members are removed entirely from downstream analyses, given that the inclusion of additional sequences in later steps would presumably improve the quality of these clusters by adding new representatives.

    - While maximizing sensitivity to remote homology is appropriate for the overarching goal of characterizing entirely unknown protein clusters, the likely decrease in specificity means that the accuracy of functional annotations and the shared function of all sequences in a cluster are suspect (as the authors are aware). It would have been interesting and valuable to extend the hierarchical clustering framework, already partially developed here, to enable both sensitive and specific annotations.