Beyond blast: enabling microbiologists to better extract literature, taxonomic distributions and gene neighbourhood information for protein families

Colbie J. Reed
Rémi Denise
Jacob Hourihan
Jill Babor
Marshall Jaroch
Maria Martinelli
Geoffrey Hutinet
Valérie de Crécy-Lagard

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Capturing the published corpus of information on all members of a given protein family should be an essential step in any study focusing on specific members of that family. Using a previously gathered dataset of more than 280 references mentioning a member of the DUF34 (NIF3/Ngg1-interacting Factor 3) family, we evaluated the efficiency of different databases and search tools, and devised a workflow that experimentalists can use to capture the most information published on members of a protein family in the least amount of time. To complement this workflow, web-based platforms allowing for the exploration of protein family members across sequenced genomes or for the analysis of gene neighbourhood information were reviewed for their versatility and ease of use. Recommendations that can be used for experimentalist users, as well as educators, are provided and integrated within a customized, publicly accessible Wiki.

Version published to 10.1099/mgen.0.001183 on Access Microbiology
Feb 7, 2024
Arcadia Science
Jan 5, 2024

that many researchers continue to resort to “reinventing the wheel” instead of developing tools to bridge the gaps between different resources or, alternatively, work with existing resources to improve their interoperability

I think this is a really good point. I suspect this may in part be due to the fact that bridging resources is not as flashy as a 'new' tool. It would be great if there was a way to incentivize work to improve interoperability. I wonder if some kind of 'hackathon' with a focused goal of bridging a specific gap between resources could be a way to encourage this.

Read the original source
Arcadia Science
Jan 5, 2024

Figure 4.

It might be worth just combining d and e (like putting e as an inset in d). The way it is set up now is a little weird and it took me a bit to notice that e was physically associated with d. For c/f, you could consider adding a small legend in the figure like "# of hits" or something so it is interpretable without the figure legend. It is also not immediately clear from looking at the figure, what is going into the venn diagram in f

Read the original source
Arcadia Science
Jan 5, 2024

Query yield distributions per search tool

I've never heard of WorldWideScience.org or ScienceResearch.org, but they seem to both produce a lot of hits compared to the other tools. Can you comment on whether you think these hits are informative or erroneous?

Read the original source
Arcadia Science
Jan 5, 2024

choice of search engine used for text-based queries is ultimately up to the user and examples of the commonly used platforms or “engines” include but are not limited to PubMed, Google Scholar, and Europe PMC.

I would love to see the results from searching on biorxiv!

Read the original source
Arcadia Science
Jan 5, 2024

Protein Family Case Study and Literature Review, Curation

Having a more complete methods section would be really beneficial here so others could recreate your search steps for your example protein, or for another protein of interest. It might be useful to frame more in a how-to guide/operating manual for learning about proteins. Right now there are a lot of interesting comparisons about the efficacy of different tools, but I think there is also an opportunity here to help onboard people onto these different tools by making your workflow and methodology more explicit and clear.

Read the original source
Arcadia Science
Jan 5, 2024

The first step in any protein family analysis requires the gathering of input data (e.g., a sequence or an identifier) that will be used as seed information for queries (Fig. 1 and Fig. S1-2) . This process generates two master lists: 1) a list of identifiers, gene/protein names; and 2) a list of representative sequences. Protein family databases such as Pfam [25], InterPro [26], CDD [27], EggNOG [28] are essential tools in generating these two lists.

it would be helpful here to be more clear about the methods and order of operations of what is actually being done to retrieve this information. For instance, is a gene/protein family name the initial input for search in these 4 databases? How is the input name selected? How is the output data downloaded and processed? What are the sanity checks to make sure your search is generating …

The first step in any protein family analysis requires the gathering of input data (e.g., a sequence or an identifier) that will be used as seed information for queries (Fig. 1 and Fig. S1-2) . This process generates two master lists: 1) a list of identifiers, gene/protein names; and 2) a list of representative sequences. Protein family databases such as Pfam [25], InterPro [26], CDD [27], EggNOG [28] are essential tools in generating these two lists.

it would be helpful here to be more clear about the methods and order of operations of what is actually being done to retrieve this information. For instance, is a gene/protein family name the initial input for search in these 4 databases? How is the input name selected? How is the output data downloaded and processed? What are the sanity checks to make sure your search is generating useful and on target information?

Read the original source
Arcadia Science
Jan 5, 2024

gist

Another tool that I have found useful - finds mobile genetic elements from genomes, metagenomes, or metatranscriptomes https://portal.nersc.gov/genomad/ https://www.nature.com/articles/s41587-023-01953-y

Read the original source
Arcadia Science
Jan 5, 2024

Figure 2.

This may just be biorxiv, but the text in the figures is pretty low resolution/fuzzy

Read the original source
Arcadia Science
Jan 5, 2024

Creating a Wiki compiling a non-exhaustive list of web-based resources organized into pedagogical modules for microbiologist

This is such a great initiative! I'd like to highlight some of my fave tools here in case its useful:

clinker my go too tool for generating gene neighborhood comparisions and figures (https://github.com/gamcil/clinker) - its command line but also looks like it can run through the CAGECAT webserver, though I have not tried that yet.

viptree is a really useful webserver application for comparing viral genomes (ive only tried it with phages) and making trees. https://www.genome.jp/viptree/

ProteinCartography is our in house tool (https://research.arcadiascience.com/pub/resource-protein-cartography/release/7) for pulling protein sequence and structural homologs, and visually exploring the data. Please check it out …

Creating a Wiki compiling a non-exhaustive list of web-based resources organized into pedagogical modules for microbiologist

This is such a great initiative! I'd like to highlight some of my fave tools here in case its useful:

clinker my go too tool for generating gene neighborhood comparisions and figures (https://github.com/gamcil/clinker) - its command line but also looks like it can run through the CAGECAT webserver, though I have not tried that yet.

viptree is a really useful webserver application for comparing viral genomes (ive only tried it with phages) and making trees. https://www.genome.jp/viptree/

ProteinCartography is our in house tool (https://research.arcadiascience.com/pub/resource-protein-cartography/release/7) for pulling protein sequence and structural homologs, and visually exploring the data. Please check it out if it seems relevant, and tell us if it is or is not useful for you!

Read the original source
Arcadia Science
Jan 5, 2024

https://vdclab-wiki.herokuapp.com/

This wiki is awesome! As someone who did have to discover these resources individually over time, it is so nice to have all of this in one place that is easy to navigate! I especially like the 'tools by objective' organization framework.

I also appreciate the 'ease of use' section that some tools have. Nice to know what you are getting into.

Read the original source
Arcadia Science
Jan 5, 2024

Family-level bioinformatic tools

We recently published a tool that takes proteomes from diverse organisms and infers orthology, gene family trees, species trees, and gene family evolutionary dynamics.

Might be worth checking out and adding to the wiki if you like it (or to uncurated finds)!

https://research.arcadiascience.com/pub/resource-noveltree/release/3

Read the original source
Version published to 10.1101/2023.05.03.539116 on bioRxiv
May 3, 2023

TaxoFlow: The Tutorial. An Educational Nextflow Pipeline for Metagenomics Taxonomic Profiling

This article has 2 authors:
1. Jeferyd Yepes-García
2. Laurent Falquet
This article has no evaluationsLatest version Dec 22, 2025
META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

This article has 8 authors:
1. Louis-Maël Guéguen
2. Alban Mathieu
3. Simon Pelletier
4. Anthony Woo
5. Namita Misra
6. Magali Moreau
7. Olivier Perin
8. Arnaud Droit
This article has no evaluationsLatest version Jan 29, 2026
Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world

This article has 7 authors:
1. Grazia Visci
2. Elisabetta Notario
3. Giuseppe Defazio
4. Mariano Francesco Caratozzolo
5. Bruno Fosso
6. Marinella Marzano
7. Graziano Pesole
This article has no evaluationsLatest version Jan 30, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

TaxoFlow: The Tutorial. An Educational Nextflow Pipeline for Metagenomics Taxonomic Profiling

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world