The Taxonomy Dictionary: a resource for correct spelling of taxa

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This article describes ‘The Taxonomy Dictionary’, a resource that can enhance the spelling engine of a text editor such as Word, so that it can correctly spell every taxon described and listed in the largest taxonomy databases. It contains around 1.4 million unique words, and once installed an incorrectly spelled taxon will be marked by the spelling engine and it will suggest possible correct spellings. Installation instructions for Firefox, LibreOffice and Microsoft Word can be found on the GitHub repository. The software is licensed with a GPL3 licence.

Article activity feed

  1. This study would be a valuable contribution to the existing literature. This is a study that would be of interest to the field and community. Thank you for your submission, we are pleased to accept the revised manuscript. Thank you for taking the effort in addressing the reviewers comments, especially in the additional work carried out with the pythons scripts and general automation of the process. Congratulations and we encourage submissions to ACMI in the future.

  2. This study would be a valuable contribution to the existing literature. This is a study that would be of interest to the field and community. The reviewers have highlighted major concerns with the work presented. Please ensure that you address their comments. Please provide more detail in the Methods section and ensure that software is consistently cited and its version and parameters included. The reviewers believe the results shown in the manuscript do not support the conclusions presented. Dear Kristian Bagge, Thank you for your submission. The reviewers have raised concerns on a few fronts while highlighting that they support the work in general. Please consider the reviewers comments, especially those surrounding the methods for building the resource and for addressing the database issues. Best wishes, John.

  3. Comments to Author

    This manuscript describes a helper tool for scientists, in the form of a dictionary of recognised taxonomic terms that can be used to populate spellcheck software, and a script that was used to compile that dictionary. As taxonomic terms are often arcane or cryptic, and not usually present in word processing software or other spellcheckers, such a dictionary could be of widespread convenience across a range of fields. Microbiology has a relative advantage in this over some other fields in having a set of agreed high-quality taxonomic resources that can be mined for such terms. The author has acquired multiple such directories of taxonomy, and claims to have compiled a wordlist, with one taxonomic term per line - the "digital dictionary" referred to in the manuscript. The main contribution of the manuscript appears to be to advertise the existence of a text file that can be used as a dictionary file and incorporated into a user's own current spellchecker (presuming the dictionary format is accepted). The "tool" is therefore not a spellchecker in its own right, and I would suggest it might be better described as a "resource" rather than a "tool", as it is an input to a tool, and not an active piece of software. I found the compilation of the dictionary to be incompletely described in the manuscript (e.g. by reference to the script in the project repository). To be fair, the GitHub repository at https://github.com/kbagge/Taxonomy_dictionary/tree/v1.0 does contain a script that appears to have been used to generate the dictionary, and the Zenodo record for this is linked from the paper, but I would still expect to see an outline of the process used to convert the input directories into the final dictionary - this would be expected of a standard methodological description for a bioinformatics paper. The repository contains a shell script that provides instructions to the user explaining how the original taxonomy database files were obtained, but does not itself download them. This is a minimal level of reproducibility, but does not automate or make more user-friendly the process of acquiring the input data. For a resource like this I would expect a (relatively) easy to use automated tool to compile the list from the named sources. The impression I gained from the manuscript was that the process of (re)generating the dictionary would be automated but, as the GitHub repository notes: "The repository contains a script that was used to generate the dictionary. You can reproduce it yourself on your machine or get inspired and make your own dictionary for another topic. Please be aware that the script contains some manual steps that must be done before the rest can run. This was unavoidable since some of the databases needs to be downloaded manually others have to be exported from excel format." My view, as a bioinformatician, is that the manual steps are avoidable - downloads and Excel parsing can be automated and libraries exist in most common programming languages to make, for instance, automated interaction with Excel files possible. I would be sympathetic to overlooking the need for manual downloads if the word list was useful as it stood. However, the word list appears to contain non-taxonomic terms and so has not been compiled cleanly (see https://raw.githubusercontent.com/kbagge/Taxonomy_dictionary/v1.0/taxonomy.dic - commit 97a0350), e.g. these terms appear: 01-FULL-49-22b 01-FULL-54-110 02-12-FULL-59-9 02-FULL-45-10c 02-FULL-45-11b 02-FULL-45-17b 0507KN21 100268sal2 10-dentatus 10-fasciata 10-fasciatum 10-guttata 10-guttatus and I do not think they are all valid, recognised taxonomic terms. My view is that these inclusions likely derive by a combination of relatively informal taxonomic directory formats, and inadequate testing/incorrect parsing in the script. As the dictionary resource itself doesn't provide the claimed information (i.e. it includes a number of non-taxonomic terms) I do not think it - or the script that generates it - is yet ready for sharing/publication. I do think that the general idea is a good one, and that a fully-automated tool that downloads current data from the appropriate resources and compiles terms into a corresponding database/dictionary would be a publishable resource worth sharing. However, my view is that in its current state neither the script nor the dictionary meet the claims made the manuscript, or provide a reliable, reusable resource. I do think that this would be achievable with a limited amount of extra programming. I also think that the inclusion of a versioning scheme for the dictionary (even date-based versioning) would be an improvement, as it would allow users to know whether their copy of the dictionary was "current," and whether they should upgrade their local copy.

    Please rate the manuscript for methodological rigour

    Poor

    Please rate the quality of the presentation and structure of the manuscript

    Poor

    To what extent are the conclusions supported by the data?

    Partially support

    Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

    No

    Is there a potential financial or other conflict of interest between yourself and the author(s)?

    No

    If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

    Yes

  4. Comments to Author

    Taxonomy Dictionary This short manuscript describes a lexical application which can be loaded into packages such as Word to help ensure text contains correctly spelt taxonomic names of microbes. I cannot comment on the computational infrastructure used to create the tool. However, the tool itself may be of use to users in the microbiology community and so I am happy to recommend publication. My only concern is that the tools as presented here seems to be a 'single shot' collation and filtering of names from the key databases. However, these databases expand by 1000s names per annum. It would be interesting to know if there plans to make this an iterative resource i.e., will periodic updates from the databases be incorporated (in addition to the manual updates hinted at in line 78)? Minor comments: Lines 22 and 67: 1.412.046 might be clearer as "1.41 million" or "1,412,046" (as in line 55) Line 34 "public available, links" should be "publicly available; links" Line 49 "aspect are" should be "aspect is" Line 50 "process have" should be "process has" Line 56 "major" would read better than "biggest" Lines 59-61 "and fungi - that being; International… MycoBank [6] have been added." would read better as "and fungi have been added i.e., International… MycoBank [6]." Line 74 "autosuggestions are not always on spot." is a little vague. Perhaps "autosuggestions may be subject to error," Line 78 should read "and I will try to"

    Please rate the manuscript for methodological rigour

    Good

    Please rate the quality of the presentation and structure of the manuscript

    Satisfactory

    To what extent are the conclusions supported by the data?

    Strongly support

    Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

    No

    Is there a potential financial or other conflict of interest between yourself and the author(s)?

    No

    If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

    Yes