Beyond blast: enabling microbiologists to better extract literature, taxonomic distributions and gene neighbourhood information for protein families

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

Capturing the published corpus of information on all members of a given protein family should be an essential step in any study focusing on specific members of that family. Using a previously gathered dataset of more than 280 references mentioning a member of the DUF34 (NIF3/Ngg1-interacting Factor 3) family, we evaluated the efficiency of different databases and search tools, and devised a workflow that experimentalists can use to capture the most information published on members of a protein family in the least amount of time. To complement this workflow, web-based platforms allowing for the exploration of protein family members across sequenced genomes or for the analysis of gene neighbourhood information were reviewed for their versatility and ease of use. Recommendations that can be used for experimentalist users, as well as educators, are provided and integrated within a customized, publicly accessible Wiki.

Article activity feed

  1. that many researchers continue to resort to “reinventing the wheel” instead of developing tools to bridge the gaps between different resources or, alternatively, work with existing resources to improve their interoperability

    I think this is a really good point. I suspect this may in part be due to the fact that bridging resources is not as flashy as a 'new' tool. It would be great if there was a way to incentivize work to improve interoperability. I wonder if some kind of 'hackathon' with a focused goal of bridging a specific gap between resources could be a way to encourage this.

  2. Figure 4.

    It might be worth just combining d and e (like putting e as an inset in d). The way it is set up now is a little weird and it took me a bit to notice that e was physically associated with d. For c/f, you could consider adding a small legend in the figure like "# of hits" or something so it is interpretable without the figure legend. It is also not immediately clear from looking at the figure, what is going into the venn diagram in f

  3. Query yield distributions per search tool

    I've never heard of WorldWideScience.org or ScienceResearch.org, but they seem to both produce a lot of hits compared to the other tools. Can you comment on whether you think these hits are informative or erroneous?

  4. choice of search engine used for text-based queries is ultimately up to the user and examples of the commonly used platforms or “engines” include but are not limited to PubMed, Google Scholar, and Europe PMC.

    I would love to see the results from searching on biorxiv!

  5. Protein Family Case Study and Literature Review, Curation

    Having a more complete methods section would be really beneficial here so others could recreate your search steps for your example protein, or for another protein of interest. It might be useful to frame more in a how-to guide/operating manual for learning about proteins. Right now there are a lot of interesting comparisons about the efficacy of different tools, but I think there is also an opportunity here to help onboard people onto these different tools by making your workflow and methodology more explicit and clear.

  6. The first step in any protein family analysis requires the gathering of input data (e.g., a sequence or an identifier) that will be used as seed information for queries (Fig. 1 and Fig. S1-2) . This process generates two master lists: 1) a list of identifiers, gene/protein names; and 2) a list of representative sequences. Protein family databases such as Pfam [25], InterPro [26], CDD [27], EggNOG [28] are essential tools in generating these two lists.

    it would be helpful here to be more clear about the methods and order of operations of what is actually being done to retrieve this information. For instance, is a gene/protein family name the initial input for search in these 4 databases? How is the input name selected? How is the output data downloaded and processed? What are the sanity checks to make sure your search is generating useful and on target information?

  7. Creating a Wiki compiling a non-exhaustive list of web-based resources organized into pedagogical modules for microbiologist

    This is such a great initiative! I'd like to highlight some of my fave tools here in case its useful:

    clinker my go too tool for generating gene neighborhood comparisions and figures (https://github.com/gamcil/clinker) - its command line but also looks like it can run through the CAGECAT webserver, though I have not tried that yet.

    viptree is a really useful webserver application for comparing viral genomes (ive only tried it with phages) and making trees. https://www.genome.jp/viptree/

    ProteinCartography is our in house tool (https://research.arcadiascience.com/pub/resource-protein-cartography/release/7) for pulling protein sequence and structural homologs, and visually exploring the data. Please check it out if it seems relevant, and tell us if it is or is not useful for you!

  8. https://vdclab-wiki.herokuapp.com/

    This wiki is awesome! As someone who did have to discover these resources individually over time, it is so nice to have all of this in one place that is easy to navigate! I especially like the 'tools by objective' organization framework.

    I also appreciate the 'ease of use' section that some tools have. Nice to know what you are getting into.