The Taxonomy Dictionary: a resource for correct spelling of taxa

Kristian Bagge

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This article describes ‘The Taxonomy Dictionary’, a resource that can enhance the spelling engine of a text editor such as Word, so that it can correctly spell every taxon described and listed in the largest taxonomy databases. It contains around 1.4 million unique words, and once installed an incorrectly spelled taxon will be marked by the spelling engine and it will suggest possible correct spellings. Installation instructions for Firefox, LibreOffice and Microsoft Word can be found on the GitHub repository. The software is licensed with a GPL3 licence.

Version published to 10.1099/acmi.0.000521.v3 on Access Microbiology
May 1, 2023
Access Microbiology
Feb 10, 2023

This study would be a valuable contribution to the existing literature. This is a study that would be of interest to the field and community. Thank you for your submission, we are pleased to accept the revised manuscript. Thank you for taking the effort in addressing the reviewers comments, especially in the additional work carried out with the pythons scripts and general automation of the process. Congratulations and we encourage submissions to ACMI in the future.

Read the original source
Version published to 10.1099/acmi.0.000521.v2 on Access Microbiology
Feb 2, 2023
Access Microbiology
Dec 19, 2022

This study would be a valuable contribution to the existing literature. This is a study that would be of interest to the field and community. The reviewers have highlighted major concerns with the work presented. Please ensure that you address their comments. Please provide more detail in the Methods section and ensure that software is consistently cited and its version and parameters included. The reviewers believe the results shown in the manuscript do not support the conclusions presented. Dear Kristian Bagge, Thank you for your submission. The reviewers have raised concerns on a few fronts while highlighting that they support the work in general. Please consider the reviewers comments, especially those surrounding the methods for building the resource and for addressing the database issues. Best wishes, John.

Read the original source
Access Microbiology
Dec 12, 2022

Comments to Author

This manuscript describes a helper tool for scientists, in the form of a dictionary of recognised taxonomic terms that can be used to populate spellcheck software, and a script that was used to compile that dictionary. As taxonomic terms are often arcane or cryptic, and not usually present in word processing software or other spellcheckers, such a dictionary could be of widespread convenience across a range of fields. Microbiology has a relative advantage in this over some other fields in having a set of agreed high-quality taxonomic resources that can be mined for such terms. The author has acquired multiple such directories of taxonomy, and claims to have compiled a wordlist, with one taxonomic term per line - the "digital dictionary" referred to in the manuscript. The main contribution of the …

Comments to Author

This manuscript describes a helper tool for scientists, in the form of a dictionary of recognised taxonomic terms that can be used to populate spellcheck software, and a script that was used to compile that dictionary. As taxonomic terms are often arcane or cryptic, and not usually present in word processing software or other spellcheckers, such a dictionary could be of widespread convenience across a range of fields. Microbiology has a relative advantage in this over some other fields in having a set of agreed high-quality taxonomic resources that can be mined for such terms. The author has acquired multiple such directories of taxonomy, and claims to have compiled a wordlist, with one taxonomic term per line - the "digital dictionary" referred to in the manuscript. The main contribution of the manuscript appears to be to advertise the existence of a text file that can be used as a dictionary file and incorporated into a user's own current spellchecker (presuming the dictionary format is accepted). The "tool" is therefore not a spellchecker in its own right, and I would suggest it might be better described as a "resource" rather than a "tool", as it is an input to a tool, and not an active piece of software. I found the compilation of the dictionary to be incompletely described in the manuscript (e.g. by reference to the script in the project repository). To be fair, the GitHub repository at https://github.com/kbagge/Taxonomy_dictionary/tree/v1.0 does contain a script that appears to have been used to generate the dictionary, and the Zenodo record for this is linked from the paper, but I would still expect to see an outline of the process used to convert the input directories into the final dictionary - this would be expected of a standard methodological description for a bioinformatics paper. The repository contains a shell script that provides instructions to the user explaining how the original taxonomy database files were obtained, but does not itself download them. This is a minimal level of reproducibility, but does not automate or make more user-friendly the process of acquiring the input data. For a resource like this I would expect a (relatively) easy to use automated tool to compile the list from the named sources. The impression I gained from the manuscript was that the process of (re)generating the dictionary would be automated but, as the GitHub repository notes: "The repository contains a script that was used to generate the dictionary. You can reproduce it yourself on your machine or get inspired and make your own dictionary for another topic. Please be aware that the script contains some manual steps that must be done before the rest can run. This was unavoidable since some of the databases needs to be downloaded manually others have to be exported from excel format." My view, as a bioinformatician, is that the manual steps are avoidable - downloads and Excel parsing can be automated and libraries exist in most common programming languages to make, for instance, automated interaction with Excel files possible. I would be sympathetic to overlooking the need for manual downloads if the word list was useful as it stood. However, the word list appears to contain non-taxonomic terms and so has not been compiled cleanly (see https://raw.githubusercontent.com/kbagge/Taxonomy_dictionary/v1.0/taxonomy.dic - commit 97a0350), e.g. these terms appear: 01-FULL-49-22b 01-FULL-54-110 02-12-FULL-59-9 02-FULL-45-10c 02-FULL-45-11b 02-FULL-45-17b 0507KN21 100268sal2 10-dentatus 10-fasciata 10-fasciatum 10-guttata 10-guttatus and I do not think they are all valid, recognised taxonomic terms. My view is that these inclusions likely derive by a combination of relatively informal taxonomic directory formats, and inadequate testing/incorrect parsing in the script. As the dictionary resource itself doesn't provide the claimed information (i.e. it includes a number of non-taxonomic terms) I do not think it - or the script that generates it - is yet ready for sharing/publication. I do think that the general idea is a good one, and that a fully-automated tool that downloads current data from the appropriate resources and compiles terms into a corresponding database/dictionary would be a publishable resource worth sharing. However, my view is that in its current state neither the script nor the dictionary meet the claims made the manuscript, or provide a reliable, reusable resource. I do think that this would be achievable with a limited amount of extra programming. I also think that the inclusion of a versioning scheme for the dictionary (even date-based versioning) would be an improvement, as it would allow users to know whether their copy of the dictionary was "current," and whether they should upgrade their local copy.

Please rate the manuscript for methodological rigour

Poor

Please rate the quality of the presentation and structure of the manuscript

Poor

To what extent are the conclusions supported by the data?

Partially support

Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

No

Is there a potential financial or other conflict of interest between yourself and the author(s)?

No

If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

Yes

Read the original source
Access Microbiology
Dec 5, 2022

Comments to Author

Taxonomy Dictionary This short manuscript describes a lexical application which can be loaded into packages such as Word to help ensure text contains correctly spelt taxonomic names of microbes. I cannot comment on the computational infrastructure used to create the tool. However, the tool itself may be of use to users in the microbiology community and so I am happy to recommend publication. My only concern is that the tools as presented here seems to be a 'single shot' collation and filtering of names from the key databases. However, these databases expand by 1000s names per annum. It would be interesting to know if there plans to make this an iterative resource i.e., will periodic updates from the databases be incorporated (in addition to the manual updates hinted at in line 78)? Minor comments:

Comments to Author

Taxonomy Dictionary This short manuscript describes a lexical application which can be loaded into packages such as Word to help ensure text contains correctly spelt taxonomic names of microbes. I cannot comment on the computational infrastructure used to create the tool. However, the tool itself may be of use to users in the microbiology community and so I am happy to recommend publication. My only concern is that the tools as presented here seems to be a 'single shot' collation and filtering of names from the key databases. However, these databases expand by 1000s names per annum. It would be interesting to know if there plans to make this an iterative resource i.e., will periodic updates from the databases be incorporated (in addition to the manual updates hinted at in line 78)? Minor comments: Lines 22 and 67: 1.412.046 might be clearer as "1.41 million" or "1,412,046" (as in line 55) Line 34 "public available, links" should be "publicly available; links" Line 49 "aspect are" should be "aspect is" Line 50 "process have" should be "process has" Line 56 "major" would read better than "biggest" Lines 59-61 "and fungi - that being; International… MycoBank [6] have been added." would read better as "and fungi have been added i.e., International… MycoBank [6]." Line 74 "autosuggestions are not always on spot." is a little vague. Perhaps "autosuggestions may be subject to error," Line 78 should read "and I will try to"

Please rate the manuscript for methodological rigour

Good

Please rate the quality of the presentation and structure of the manuscript

Satisfactory

To what extent are the conclusions supported by the data?

Strongly support

Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

No

Is there a potential financial or other conflict of interest between yourself and the author(s)?

No

If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

Yes

Read the original source
Version published to 10.1099/acmi.0.000521.v1 on Access Microbiology
Nov 23, 2022

Facilitating multilingual research publishing: Translations of the Contributor Roles Taxonomy (CRediT)

This article has 73 authors:
1. Alex O. Holcombe
2. Marton Kovacs
3. Malgorzata Lagisz
4. Bjørn Sætrevik
5. Pietro Pollo
6. Dmitry Kochetkov
7. Rasmus Pedersen
8. Dunja Mićunović
9. Befkadu Mewded
10. Saeed Shafiei Sabet
11. Ulvhild Færøvik
12. Wawa Keren Yu
13. David C. Vaidis
14. Erika M. Santana
15. Nina Trubanová
16. Marton Aron Varga
17. Aswathi Surendran
18. Antica Čulina
19. Caleb Onoja Akogwu
20. Radana Chytilová
21. Ayumi Mizuno
22. Stavroula Litsiou
23. Yefeng Yang
24. Manh-Toan Ho
25. Elina Takola
26. Octavia-Luciana Madge
27. Ineta Kačergytė
28. Timo Lüke
29. Dasapta Erwin Irawan
30. Rasmus Overmark
31. Omayma Missawi
32. Anja Bošnjak
33. Marziyeh Amini Fard
34. Milica D. Pavlovic
35. Marta Kowal
36. Milica Ševkušić
37. Elena Popescu
38. Viktória Šinkorová
39. Marek Albert Vranka
40. Helena Hartmann
41. Anita Tarandek
42. Zuzana Irsova
43. Boniface Maenge Munyao
44. Irina Kochetkova
45. Mengesha Asefa
46. Harriet Melany Nyamvula
47. Jemimah Mutisya Kavinya
48. Hedvig K Nenzén
49. Jonas Knape
50. Georgia Daraki
51. Murat Tahtali
52. Arobindu Dash
53. Heba AbdElAziz Moussa AbdAlla
54. Jussi Lehtonen
55. Heikki Lehtonen
56. Riva Quiroga
57. Dongjin Kim
58. Jakub Krasucki
59. Amélie Gourdon-Kanhukamwe
60. Barbora Drąsutytė-Vaičiukynė
61. Marc Roger Bria Ramírez
62. Jordi Lacruz Casado
63. Jin-Won Lee
64. Hye-Kyoung Moon
65. Manisha Sinha
66. Garga Chatterjee
67. Stevanus Nalendra Jati
68. Sawaka Oka
69. Swastika Issar
70. Lorenzo Ricolfi
71. Vittoria Porta
72. Ugur Turhan
73. Natalya Popova
This article has no evaluationsLatest version Apr 14, 2026
English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

This article has 4 authors:
1. Salma Kazemi Rashed
2. Rafsan Ahmed
3. Johan Frid
4. Sonja Aits
This article has no evaluationsLatest version Feb 23, 2026
Beyond Delta: Introducing an Angle Metric for Stylometric Similarity

This article has 2 authors:
1. Olga G. Gorina
2. Natalya S. Tsarakova
This article has no evaluationsLatest version Apr 16, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Facilitating multilingual research publishing: Translations of the Contributor Roles Taxonomy (CRediT)

English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

Beyond Delta: Introducing an Angle Metric for Stylometric Similarity