Standardized genome-wide function prediction enables comparative functional genomics: a new application area for Gene Ontologies in plants

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Genome-wide gene function annotations are useful for hypothesis generation and for prioritizing candidate genes potentially responsible for phenotypes of interest. We functionally annotated the genes of 18 crop plant genomes across 14 species using the GOMAP pipeline.

Results

By comparison to existing GO annotation datasets, GOMAP-generated datasets cover more genes, contain more GO terms, and are similar in quality (based on precision and recall metrics using existing gold standards as the basis for comparison). From there, we sought to determine whether the datasets across multiple species could be used together to carry out comparative functional genomics analyses in plants. To test the idea and as a proof of concept, we created dendrograms of functional relatedness based on terms assigned for all 18 genomes. These dendrograms were compared to well-established species-level evolutionary phylogenies to determine whether trees derived were in agreement with known evolutionary relationships, which they largely are. Where discrepancies were observed, we determined branch support based on jackknifing then removed individual annotation sets by genome to identify the annotation sets causing unexpected relationships.

Conclusions

GOMAP-derived functional annotations used together across multiple species generally retain sufficient biological signal to recover known phylogenetic relationships based on genome-wide functional similarities, indicating that comparative functional genomics across species based on GO data holds promise for generating novel hypotheses about comparative gene function and traits.

Article activity feed

  1. Background

    **Reviewer 2. Alexandre R. Paschoal **

    The authors present a "Standardized genome-wide function prediction enables comparative functional genomics: a new application area for Gene Ontologies in plants". It seems to be an application of the GOMAP pipeline in 14 species which sounds to be interesting. However, it lacks polish, thats why I list some suggestions to help the authors improve it.

    • Major: 1-) It is not clear the state-of-art tools in this topic (this is not detailed in introduction, which is a serious gap in this work), including similar or same tools/methods for the same purpose. Keep this in mind, please, compare against tools from literature in the same issue please. I am not an expert in the GO topic, but as far as I know, there is Blast2GO, and I found others: https://www.mdpi.com/1999-5903/13/7/172 https://academic.oup.com/nar/article/49/D1/D394/6027812?login=true https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13285 https://david.ncifcrf.gov/ blast2go etc 1.2.) Please, compare against these tools, or those that make sense, if not, why not? Clarify and make clear this, please. PS: Re-write the introduction to address these points. 2-) The idea is to do a large-scale analysis of several plant species (in case 14) using the GOMAP pipeline. Is it? 2.1-) Please make clear how many and each of the contributions in the abstract and introduction. 2.2-) We now have more than 70 plants species in Ensembl Plants, for example. Why do not use all of them as much as possible for a real large-scale analysis? 3-) If I understood, the GitHub (https://github.com/Dill-PICL/GOMAP-Paper-2019.1) and the https://dill- picl.org/projects/gomap/ gomap-datasets/ contains all the data results from this report, is it? 3.1.) Both (and mainly GitHub) are far from being user-friendly. It seems the author put the information and that it. Be more clear, what, how, why this information is there, and how to use it. Also, is it clear all commands used (maybe provide a manual on what you have done in a tech aspect)? 3.2.) Is there any visualization table, I mean an easy output produced by this analysis? If I want to use this data for my new genome etc, real case, how to use it? how to compare? where? sequence? GO terms, etc? Clarify this, please. PS: Imagine that is a biologist that wants to use your approach. 4-) For me is not clear why do not also put Ensembl Plants in this report analysis, and only Phytozome and Gramene. Please, include and compare all these databases. 5-) Authors mention that they will make available the final results in Zenodo after this revision. Please, make all data, FASTA, trees etc available.
    • Minor:
    • How often do you expect to update this tool? Make clear this point, please
    • Could you clarify all the diff. among your work and Zhu et al. work?
    • Did you expect to have any significance (bootstrap) on the trees fig.?
    • Page 5/6, there are zero space lines in section D., and some ?? reference in fig and reference, please, correct this issue.
  2. Abstract

    This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.65), and has published the reviews under the same license. These are as follows.

    Reviewer 1. Leonore Reiser

    Reviewer Comments to Author: The authors present a detailed assessment and creative analysis of computationally predicted functional annotations for 18 plant genomes. First they applied their GOMAP pipeline to annotate the genomes, and compared those outputs against a 'Gold Standard' of Gramene annotations (minus those inferred from Electronic Annotation) and electronically inferred annotations from Gramene or Phytozome. They then used the GOMAP annotation set in an interesting way to perform a sort of phylogenetic reconstruction.

    First, I applaud the authors for presenting a manuscript that is a paragon of data FAIRness. The data is findable, accessible, well annotated with metadata and certainly looks reusable (what a pleasure to have the option to download as a CSV.) Bravo! Brava!

    The idea of recapitulation of phylogenetic relationships based on GO annotations is an interesting one and while authors do a good of addressing some of the caveats and limitations of their analysis I do wonder if there are other things that they may want to consider. For example a lot of plant annotations are based off of Arabidopsis experimental annotations, which means that some aspects of plant biology that are unique to specific clades may not be well represented in the ontology because those processes have not been annotated or the terms may not even exist in the ontologies yet. Also, at least for Arabidopsis, many of the included annotations come from PAINT which is a phylogenetic based annotation method (the IBA annotations) so transferring IBA annotations from Arabidopsis to other plant species might add a certain phylogenetic flavor the GO MAP results.

    Specific comments on the text. 1.Please clarify what the sets of terms used were and what is meant by ancestors? The MS states, granular terms were mapped to higher order terms used for comparison- how were those ancestor terms selected? is there a list of these common (S) terms that were used to generate the trees available somewhere? If so, that subset should be made available (or maybe it is but I could not tell.) I think this selection of terms for use in the analysis is really important but could not find any data for this- if the data is available it is not obvious.

    2.Annotations with modifiers that were removed- can you clarify what is meant by that , are those 'Not' annotations?

    3.One expects a high level of granularity for manually curated gene functions (that is very specific terms) how are annotations harmonized across the different prediction methods used for GO MAP since presumably some of the methods employed provide less specificity in their annotations?

    4.For the comparison, was there any manual inspection of presence or absence of terms? Was there any correspondance with anything known biologically? That is for certain term character states, were there any unexpected or inconsistent with biology?

    5.The phylogenetic analysis seems to factor in all 3 GO aspects, have the authors compared results using just a single aspect (process function or component?) Process is notoriously noisy and annotations can be subject to a lot of interpretation. It is also probably the most incomplete data set,

    Specific comments on the figures.

    1. Panel b. Gramene -IEA is confusing here in the figure and when described elsewhere. I suggest that in the figure ,and the text, using less confusing nomenclature such as Gramene (IEA only) and Gramene (no IEA) for gold standard. To me I read Gramene-IEA as Gramene minus IEA annotations and not Gramene's IEA annotations only.

    2.Supplementary Figure S1. I wonder if there is a more effective way to visualize this data. I think there is a lot of interesting information here but it is hard to follow , especially the third graph. Another improvement to readability would be to make the text font darker (not sure why it is light grey.