Annotating Metagenomically Assembled Bacteriophage from a Unique Ecological System using Protein Structure Prediction and Structure Homology Search

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Emergent long read sequencing technologies such as Oxford’s Nanopore platform are invaluable in constructing high quality and complete genomes from a metagenome, and are needed investigate unique ecosystems on a genetic level. However, generating informative functional annotations from sequences which are highly divergent to existing nucleotide and protein sequence databases is a major challenge. In this study, we present wet and dry lab techniques which allowed us to generate 5432 high quality sub-genomic sized metagenomic circular contigs from 10 samples of microbial communities. This unique ecological system exists in an environment enriched with naphthenic acid (NA), which is a major toxic byproduct in crude oil refining and the major carbon source to this community. Annotation by sequence homology alone was insufficient to characterize the community, so as proof of principle we took a subset of 227 putative bacteriophage and greatly improved our existing annotations by predicting the structures of hypothetical proteins with ColabFold and using structural homology searching with Foldseek. The proportion of proteins for each bacteriophage that were highly similar to known proteins increased from approximately 10% to about 50%, while the number of annotations with KEGG or GO terms increased from essentially 0% to 15%. Therefore, protein structure prediction and homology searches can produce more informative annotations for microbes in unique ecological systems. The characterization of novel microbial ecosystems involved in the bioremediation of crude oil-process-affected wastewater can be greatly improved and this method opens the door to the discovery of novel NA degrading pathways.

IMPORTANCE

Functional annotation of metagenomic assembled sequences from novel or unique microbial communities is challenging when the sequences are highly dissimilar to organisms or proteins in the known databases. This is a major obstacle for researchers attempting to characterize the functional capabilities of unique ecosystems. In this study, we demonstrate that including protein structure prediction and homology search based methods vastly improves the annotation of predicted genes identified in novel putative bacteriophage in a bacterial community that degrades naphthenic acids the major toxic component of oil refinery wastewater. This method can be extended to similar genomics studies of unique, uncharacterized ecosystems, to improve their annotations.

P lease read the Instructions to Authors carefully, or browse the FAQs for further details.

Article activity feed

  1. Data and Code Availability

    It would be great if you could make the polished assemblies or assembled contigs analyzed in this study available since it takes quite a bit of work to get to that point

  2. querying only the best Foldseek hits, which are filtered for an e-value greater than 1e-10,

    Did you take into account other filtering criteria such as Tm score? Or analyze how evalue cutoff corresponded to Tm score?

  3. After the initial assembly, additional assemblies were yielded using a secondary assembly pipeline. Briefly, reads for a given sample were aligned to uncircularized contigs obtained from the same sample with Minimap2 v2.24 (21) and were binned using MetaBAT2 v2.12.1

    So this was done prior to polishing?

  4. Near perfect TM scores within most clusters show that the same putative best structural homolog was often seen in samples widely separated by time,

    Here the Tm score is used to compare protein structures seen in multiple samples?

  5. This threshold could be important since the expected length of the phage query protein is probably going to be a lot shorter than the target protein if it is structurally homologous to an annotated bacterial protein for example

  6. This result allowed us to identify a number of functions and pathways present in putative bacteriophage ACCs in this sample,

    Were the thresholds for the structural approach defined by a protein that had any structural hit by foldseek with a functional annotation? Or was there a threshold that needed to be met to consider that protein had a good hit - such as Tm score?

  7. structural homology vs. the entire universe of known and predicted protein structures using Foldseek (9).

    Was this using the Foldseek server? Or what databases did you compare against to consider for functional information?

  8. For this, we collected circular MAGs of > 1 Mbp

    could you possibly be throwing out candidate phyla that have small genomes but are likely circular with this filter? For example I think Patescibacteria are smaller than 1 Mbp, usually fall somewhere around 80% complete with CheckM, but end up as circular contigs

  9. INHERIT package (11) which assigns scores based on the inferred likelihood of being bacteriophage; of these, 227 bacteriophage were predicted.

    Did you try other viral/phage predicting software such as VIBRANT etc.?

  10. The biofilm initially aids in the remediation of wastewater from NAs, but ultimately overgrowth fouls the GAC beds necessitating frequent and regular exchange. As such, the samples collected in this study presents a unique opportunity to investigate and annotate NA degrading bacteria as it serves as a natural experiment.

    I'm somewhat confused from the title of the manuscript and abstract/importance sections if this manuscript will focus on annotating by structure bacteriophage sequences or bacteria as well?

  11. Although many tools and techniques

    For clarity, I would start the introduction with this paragraph since the manuscript focuses on annotating sequences by structure and not so much about long-read sequencing technology itself

  12. In this study, we present wet and dry lab techniques which allowed us to generate 5432 high quality sub-genomic sized metagenomic circular contigs from 10 samples of microbial communities. This unique ecological system exists in an environment enriched with naphthenic acid (NA), which is a major toxic byproduct in crude oil refining and the major carbon source to this community. Annotation by sequence homology alone was insufficient to characterize the community,

    Are these sentences referring to circular contigs that are proposed to be phage or just circular contigs in general? From the title I infer that you are only focused on phage sequences but these sentences make it seem as if you are trying to annotate everything through a structural approach