The LOTUS initiative for open knowledge management in natural products research

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    Rutz et al. outline LOTUS, a new open-source database that links natural product structures with the organisms they are present in. It contains over 700,000 referenced structure-organism pairs and search tools that make mining the database intuitive and efficient. The LOTUS Initiative comprises an important data harmonization/integration effort over previous databases. The results are distributed to the public through Wikidata, which additionally supports future curation. This new resource is likely to be of great interest to natural product researchers as well as across fields of biology including ecology, evolution, and biochemistry.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Contemporary bioinformatic and chemoinformatic capabilities hold promise to reshape knowledge management, analysis and interpretation of data in natural products research. Currently, reliance on a disparate set of non-standardized, insular, and specialized databases presents a series of challenges for data access, both within the discipline and for integration and interoperability between related fields. The fundamental elements of exchange are referenced structure-organism pairs that establish relationships between distinct molecular structures and the living organisms from which they were identified. Consolidating and sharing such information via an open platform has strong transformative potential for natural products research and beyond. This is the ultimate goal of the newly established LOTUS initiative, which has now completed the first steps toward the harmonization, curation, validation and open dissemination of 750,000+ referenced structure-organism pairs. LOTUS data is hosted on Wikidata and regularly mirrored on https://lotus.naturalproducts.net . Data sharing within the Wikidata framework broadens data access and interoperability, opening new possibilities for community curation and evolving publication models. Furthermore, embedding LOTUS data into the vast Wikidata knowledge graph will facilitate new biological and chemical insights. The LOTUS initiative represents an important advancement in the design and deployment of a comprehensive and collaborative natural products knowledge base.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    This manuscript addresses a major issue facing consumers of structure-organism pair data: the landscape of databases is very difficult to navigate due to the way data is made available (many resources do not have structured data dumps) and the way data is standardized (many resources' structured data dumps do not standardize their nomenclature or use stable entity identifiers). The solution presented is a carefully constructed pipeline (see Figure 1) for importing data, harmonizing/cleaning it, automating decisions about exclusions, and reducing redundancy. The results are disseminated through Wikidata to enable downstream consumption via SPARQL and other standard access methods as well as through a bespoke website constructed to address the needs of the natural products community. The supplemental section of the manuscript provides a library of excellent example queries for potential users. The authors suggest that users may be motivated to make improvements through manual curations on Wikidata, through semi-automated and automated interaction with Wikidata mediated by bots, or by addition of importer modules to the LOTUS codebase itself.

    Despite the potential impact of the paper and excellent summary of the current landscape of related tools, it suffers from a few omissions and tangents:

    1. It does not cite specific examples of downstream usages of structure-organism pairs, such as an illustration on how this information in both higher quantity and quality is useful for drug discovery, agriculture, artificial intelligence, etc. These would provide a much more satisfying bookend to both the introduction and conclusion.

    Thank you for this remark. We deliberately decided not to insist too heavily on the application examples of the LOTUS outputs. Indeed we are somehow biased by our main investigation field, natural products chemistry, and expect that the dissemination of specialized metabolites occurrences will benefit a wide range of scientific disciplines (ecology, drug discovery, chemical ecology, ethnopharmacology, etc.)

    However, Figure 5 was established to illustrate how the information available through LOTUS is quantitatively (size) and qualitatively (color classes) superior to what is available through single natural products resources.

    As added in the introduction, one of the downstream usages of those pairs is for example to perform taxonomically informed scoring as described in https://doi.org/10.3389/fpls.2019.01329. Obtaining an open database of natural products’ occurrences to fuel such taxonomically informed metabolite annotation tools was the initial impulse for us to build LOTUS. These metabolite annotation strategies, tailored for specialized metabolites, have been shown to offer appreciable performance improvements for current state-of-the-art computational metabolite annotation tools. Since metabolite annotation is still regularly cited as “the major bottleneck” in metabolomics in the scientific literature over the last 15 years (https://europepmc.org/article/med/15663322, https://doi.org/10.1021/acs.analchem.1c00238), any tangible improvement in this field is welcome. With LOTUS we offer a reliable and reusable structures-organisms data source that can be exploited by the community to tackle such issues of importance.

    Other possible usages are suggested in the conclusion, but benchmarking or even exemplifying such uses is clearly out of the scope of this paper, each one of them being an article per se.

    The additional queries are written in our first answer (see “essential revisions”) and demonstrate the impact of LOTUS on accelerating the initial bibliographic survey of chemical structures occurrences over the tree of life.

    This query (https://w.wiki/4VGC) can be compared to a literature review work, such as https://doi.org/10.1016/j.micres.2021.126708. In seconds, it allows retrieving a table listing compounds reported in given taxa and limits the search by years.

    1. The mentions of recently popular buzzwords FAIR and TRUST should be better qualified and be positioned as a motivation for the work, rather than a box to be checked in the modern publishing climate.

    It is true that the modern publishing system certainly suffers from some drawbacks (also critically mentioned within the paper). However, after consultation of all authors, we believe that because LOTUS checks both boxes of FAIR and TRUST, we would rather stick to these two terms. In our view, rules 1 (Don’t reinvent the wheel) and 5 (put yourself in your user’s shoes) of https://doi.org/10.1371/journal.pcbi.1005128 apply here. Both terms are indeed commonly (mis-)used but we felt that redefining other complicated terms would not help the reader/user.

    1. The current database landscape really is bad; and the authors should feel emboldened to emphasize this in order to accentuate the value of the work, with more specific examples on some of the unmaintained databases

    We perfectly agree with this statement and it is the central motivation of the LOTUS initiative to improve this landscape. It was a deliberate choice not to emphasize how bad the actual landscape is, but rather to focus on better habits for the future. We do not want to start devaluing other resources and elevate our initiative at the cost of others. We also believe that an attentive look at the complexity of the LOTUS gathering, harmonization, and curation speaks for itself and describes the huge efforts required to access properly formatted natural products occurrence data.

    If the reviewer and editors insist, although not in our scope, we are happy to list a series of specific (but anonymized) examples of badly formatted entries, of wrong structures-organisms associations, or poorly accessible resources.

    1. While the introduction and supplemental tables provide a thorough review of the existing databases, it eschews an important more general discussion about data stewardship and maintenance. Many databases in this list have been abandoned immediately following publication, have been discontinued after a single or limited number of updates, or have been decommissioned/taken down. This happens for a variety of reasons, from the maintainer leaving the original institution, from funding ending, from original plans to just publish then move on, etc. The authors should reflect on this and give more context for why this domain is in this situation, and if it is different from others.

    We do agree with the reviewer and added a “status” column in the table https://github.com/lotusnprod/lotus-processor/blob/main/docs/dataset.csv We chose 4 possible statuses:

    • Maintained (self-explanatory)
    • Unmaintained: the database did not see any update in the last year.
    • Retired: the authors stated they will not maintain the database anymore.
    • Defunct: the database is not accessible anymore

    As for question 3 above, we decided not to focus too heavily on the negative points and resume the current situation in the previous table. Reasons for the databases publishing being in this situation are multiple, and we think they are well summarized in https://doi.org/10.1371/journal.pcbi.1005128 (Rule 10: Maintain, update, or retire), already cited in the manuscript introduction.

    1. Related to data stewardship: the LOTUS Initiative has ingested several databases that are no longer maintained as well as several databases with either no license or a more restrictive license than the CC0 under which LOTUS and Wikidata are distributed. These facts are misrepresented in Supplementary Table 1 (Data Sources List), which links to notes in one of the version controlled LOTUS repositories that actually describes the license. For example, https://gitlab.com/lotus7/lotus-processor/-/blob/8b60015210ea476350b36a6e734ad6b66f2948bc/docs/licenses/biofacquim.md states that the dataset has no license information. First, the links should be written with exactly what the licenses are, if available, and explicitly state if no license is available. There should be a meaningful and transparent reflection in the manuscript on whether this is legally and/or scientifically okay to do - especially given the light that many of these resources are obviously abandoned.

    This point is a very important one. We did our best to be as transparent as possible in our initial table. Following the reviewer’s suggestion, we updated it to better reflect the licensing status of each resource (https://github.com/lotusnprod/lotus-processor/blob/main/docs/dataset.csv). Therefore, we removed the generic “license” header, which could indeed be misleading, and replaced it with ”licensing status”, filled with the attributed license type and hyperlink to its content). It remains challenging since some resources changed their copyright in the meantime. We remain at the editor and reviewers’ disposal for any further improvement.

    Moreover, as stated in the manuscript, we took care of collecting all licenses and contacted authors of resources whose license was not perfectly explicit to us, therefore accomplishing our due diligence. Additionally, we contacted legal offices in our University and explained our situation. We did everything that we had been advised.

    1. To the best of our knowledge, the dissemination of the LOTUS initiative data falls under the Right to quote for scientific articles, as we do not share the whole information, but only a very small part.

    2. We do not redistribute original content. What comes out of LOTUS has undergone several curation and validation steps, adding value to the original data. The 500 random test entries, provided in their original form for the sake of reproducibility and testing, are the only exception.

    Many scientific authors forget about the importance of proper licensing. While it might be deliberate to restrict the use, inappropriate license choice (or omission) is too often due to a lack of information on its implication.

    All authors of the utilized resources can freely benefit from our curation. We are sharing with the community the results of our work, while always citing the original reference.

    Concerning the possible evolution of licensing, it remains a real challenge. While we tried to “freeze” the license status when we accessed the data, some resources updated their licensing since then. This can be tracked in the git history of the table (https://github.com/lotusnprod/lotus-processor/blob/main/docs/dataset.csv). Discrepancies between our frozen licensing (at the time of gathering) and actual license can therefore occur. Initiatives such as https://archive.org/web could help solving this issue, coming with other legal challenges.

    1. The order of sections of the manuscript results in several duplicated, but not further substantiated explanations. Most importantly, the methods should be much more specific throughout and the results/discussion should more heavily cross-link to it, as a reader who examines the paper from top to bottom will be left with large holes of misunderstanding throughout.

    As our paper focuses a lot on the methods, the barrier between results & methods becomes thinner. We took into account the reviewers’ suggestions and added some additional cross-links for the reader to be able to quickly access related methods.

    1. The work presented was done in a variety of programming languages across a variety of repositories (and even version control systems), making it difficult to give a proper code review. It could be argued that the most popular language in computational science at the moment is Python, with languages like R, Bash, and in some domains, still, Java maintaining relevance. The usage of more esoteric languages (again, with respect to the domain) such as Kotlin hampers the ability for others to deeply understand the work presented. Further, as the authors suggest additional importers may implemented in the future, this restricts what external authors may be able to contribute.

    Scientific software has indeed always been written in multiple languages. To this day, scientists have used all kinds of languages adapted both to their needs and their knowledge. Numpy uses Fortran libraries and many projects published in biology and chemistry recently are in Java, R, Python, C#, PHP, Groovy, Scala… We understand that some authors are more comfortable with one language or another. But R syntax is for example much more distant from Python's syntax than Kotlin can be. We needed a highly performant language for some parts of the pipeline and R, Bash, or Python were not sufficient. We decided to use Kotlin as it provides an easier syntax than Java while staying 100% compatible with it.

    The advantage of the way LOTUS is designed is that importers are language-agnostic. As long as the program can produce a file or write to the DB in the accepted format, it can be integrated into the pipeline. This was our goal from the beginning, to have a pipeline that can have its various parts replaced without breaking any of the processes.

    1. As a follow up to the woes of point 4., 5., and 7., the manuscript fails to reflect on the longevity of the LOTUS Initiative. Like many, will the project effectively end upon publication? If not, what institutions will be maintaining it for how long, how actively, and with what funding source? If these things are not clear, it only seems fair to inform the reader and potential user.

    LOTUS is an initiative that aims to improve knowledge management and sharing in natural products research. Our first project, which is the object of the current manuscript, is to provide a free and open resource of natural products occurrences for the scientific community. Its purpose is not to be a database by itself, but instead to provide through Wikidata and associated tools a way to access natural products knowledge. The objective was not to create yet another database (https://doi.org/10.1371/journal.pcbi.1005128), but instead to remove this need and give our community the tools and the power to act on its knowledge. This way, as everything is on Wikidata, the initiative is not “like many”. This also means that this project should not be considered and evaluated exactly like a classical DB. Once the initial curation, harmonization, and dissemination jobs have been done, they should ideally not be run again. The community should switch to Wikidata as a point of access, curation, and addition of data. If viewed with such arguments in mind, yes, LOTUS can live long!

    Wikimedia is a public not-for-profit organization, whose financial development appears to indicate solid health https://en.wikipedia.org/wiki/Wikimedia_Foundation#Finances.

    In terms of funding sources, we would like to refer to https://elifesciences.org/articles/52614#sa2 , which stated the following in response to a similar question: "Wikidata is sustained by funding streams that are different from the vast majority of biomedical resources (which are mostly funded by the NIH). Insulation from the 4-5 year funding cycles that are typical of NIH-funded biomedical resources does make Wikidata quite unique." The core of the Wikidata funding streams are donations to the Wikipedia ecosystem. These donations - with a contributor base of millions of donors from almost any country in the world, chipping in at an average order of magnitude of around 10 dollars - are likely to continue as long as that ecosystem is useful to the community of its users. See <https://wikimediafoundation.org/about/financial-reports for details>.

    1. Overall, there were many opportunities for introspection on the shortcomings of the work (e.g., the stringent validation pipeline could use improvement). Because this work is already quite impactful, I don't think the authors will be opening themselves to unfair criticism by including more thoughtful introspection, at minimum, in the conclusions section.

    We agree with the reviewer and therefore, list again the major limitations of our processing pipeline:

    First, our processing pipeline is heavy. It includes many dependencies and requires a lot of time for understanding. We are aware of this issue and tried to simplify it as much as possible while keeping what we considered necessary to ensure high data quality. Second, it can sometimes induce errors. Those errors, ranging from unnecessary discarded correct entries to more problematic ones can be attributed to various parameters, reflecting the variety of our input. We will therefore try listing them, keeping in mind that the list won’t be exhaustive. For each detected issue, we tried fixing it at best, knowing it will not lead to an ideal result, but hopefully increase data quality gradually.

    ● Compounds

    ○ Sanitization (the three steps below are performed automatically since we observed a higher ratio of incorrect salts, charged or dimerized compounds. However, this also means that true salts, charged or dimeric compounds were erroneously “sanitized”.)

    ■ Salt removals

    ■ Charged molecules

    ■ Dimers

    ○ Translation (both processes below are pretty error-prone)

    ■ Name to structure

    ■ Structure to name

    ● Biological organisms

    ○ Synonymy

    ■ Lotus (https://www.wikidata.org/wiki/Q3645698, https://www.wikidata.org/wiki/Q16528).

    This is also one of the reasons why we decided to call the resource Lotus, as it illustrates part of the problem.

    ■ Iris (https://www.wikidata.org/wiki/Q156901, https://www.wikidata.org/wiki/Q2260419)

    ■ Ficus variegata (https://www.wikidata.org/wiki/Q502030, https://www.wikidata.org/wiki/Q5446649)

    ○ External and internal dictionaries are not exhaustive, impacting translation

    ○ Some botanical names we use might not be the accepted ones anymore because of the tools we use and the pace taxonomy is renaming taxa.

    ● References

    ○ The tool we favored, Crossref, returns a hit whatever the input. This generates noise and incorrect translations, which is why our filtering rules focus on reference types.

    ● Filtering rules:

    ○ Limited validation set, requires manual validation

    ○ Validates some incorrect entries (False positives)

    ○ Does not validate some correct entries (False negatives)

    Again, our processing pipeline removes entries we do not yet know how to process properly.

    Our restrictive filters but substantial contribution to Wikidata in terms of structure-organisms pairs data upload should hopefully incentivize the community to contribute by further adding its human validated data.

    We updated the conclusion part of the manuscript accordingly. See https://github.com/lotusnprod/lotus-manuscript/commit/a866a01bad10dfd8b3af90e2f30bb3ae51dd7b9e.

    Reviewer #2 (Public Review):

    Rutz et al. introduce a new open-source database that links natural products structures with the organisms they are present in (structure-organism pairs). LOTUS contains over 700,000 referenced structure-organism pairs, and their web portal (https://lotus.naturalproducts.net/) provides a powerful platform for mining literature for published data on structure-organism pairs. Lotus is built within the computer-readable Wikidata framework, which allows researchers to easily contribute, edit and reuse data within a clear and open CC0 license. In addition to depositing the database into Wikidata, the authors provide many domain-specific resources, including structure-based database searches and taxon-oriented searches.

    Strengths:

    The Lotus database presented in this study represents a cutting-edge resource that has a lot of potentials to benefit the scientific community. Lotus contains more data than previous databases, combines multiple resources into a single resource.

    Moreover, they provide many useful tools for mining the data and visualizing it. The authors were thoughtful in thinking about the ways that researchers could/would use this resource and generating tools to make it ways to use. For example, their inclusion of structure-based searches and multiple taxonomy classification schemes is very useful.

    Overall the authors seem conscientious in designing a resource that is updatable and that can grow as more data become available.

    Weaknesses/Questions:

    1. Overall, I would like to know to what degree LOTUS represents a comprehensive database. LOTUS is clearly, the best database to date, but has it reached a point where it is truly comprehensive, and can thus be used for a metanalysis or as a data source for research questions. Can it truly replace doing a manual literature search/review?

    As highlighted by the reviewer, even if LOTUS might be the most comprehensive natural products occurrences ressources at the moment, TRUE or FULL comprehensive quality of such resource will always be limited to the available data in the litterature. And the community is far from fully describing the metabolome of living beings. We however hope that the LOTUS infrastructure will offer a good place to start this ambitious and systematic description process.

    1. Yes it can serve as data source for research questions, as exemplified in the query table

    2. No, it cannot and must not replace manual literature search. Manual literature search is the best but at an enormous cost. If the outcome of such search can be made available to the whole community (eg. via Wikidata), the value of such would be even bigger. However, LOTUS can expedite a decent part of a manual litterature search and liberate time to complement this search. See our comment to the editors “To further showcase the possibilities opened by LOTUS, and also answer the remark on the comprehensiveness of our resource, we established an additional query (https://w.wiki/4VGC).This query is comparable to a literature review work, such as: https://doi.org/10.1016/j.micres.2021.126708. In seconds, it allows retrieving a table listing compounds reported in given taxa and limits the search by years.”

    We added these examples in the manuscript (see https://github.com/lotusnprod/lotus-manuscript/commit/a6ee135b83e56e8e2041d09d7ce2d5b913c1029d)

    1. Data Cleaning & Validation. The manuscript could be improved by adding more details about how and why data were excluding or included in the final upload. Why did only 30% of the initial 2.5 million get uploaded? Was it mostly due to redundant data or does the data mining approach result in lots of missed data?

    The reason for this “low” yield is that we highly favored quality over quantity (as in the F-score equation, ß being equal to 0.5, so more importance is given to the precision than the recall). Of course there is redundancy, but the rejected entries are mostly because of too low confidence level according to our developed rules. It is not fully discarded data as we keep it for further curation (ideally including the community) before uploading to Wikidata. We adapted the text accordingly.

    1. Similarly, more information about the accuracy of the data mining is needed. The authors report that the test dataset (420 referenced structure-organisms pairs) resulted in 97% true positives, what about false negatives? Also, how do we know that 420 references are sufficiently large to build a model for 2.5M datapoints? Is the training data set is sufficiently large to accurately capture the complexities of such a large dataset?

    False negatives are 3%, which is, in our opinion, a fair amount of “loss” given the quality of the data. We actually manually checked 500+ documented pairs, which is more or less the equivalent of a literature review. We were careful in sampling the entries in the right proportions, but we cannot (and did not) state they are enough. We cannot model it either, since the 2.5M+ points have absolutely different distributions, in terms of databases, quality, etc. Only “hint” is the similar behaviour among all subsets. (the 420 + 100 entries) were divided between 3 authors, which obtained similar results.

    1. Data Addition and Evolution: The authors have outlined several mechanisms for how the LOTUS database will evolve in the future. I would like to know if/how their scripts for data mining will be maintained if they will continue to acquire new data for the database. To what extent does the future of LOTUS depend on the larger natural products community being aware of the resource and voluntarily uploading to it? Are there mechanisms in place such as those associated with sequencing data and NCBI?

    Programs have been not only maintained but also updated with new possibilities (as, for example: the addition of a “manual mode” allowing user to run the LOTUS processing pipeline on a set of their own entries and make them Wikidata-ready (https://github.com/lotusnprod/lotus-processor/commit/f49e4e2b3814766d5497f9380bfe141692f13f23). We will of course do our best to keep on maintaining it, but as no one in academia can state he/she will maintain programs forever. However the LOTUS initiative hopefully embraces a new way of considering database dynamics. If the repository and website of the LOTUS initiative shut down tomorrow, all the work done will still be available to anyone on Wikidata. Of course, future data addition strongly relies on community involvement. We have already started to advocate for the community to start taking part of it, in the form of direct upload to Wikidata, ideally. At the time, there are no mechanisms in place to push publishing of the pairs on Wikidata (as for sequencing, mass spec data), but we will be engaged in pushing forward this direction. The initiative needs stronger involvement of the publishing sector (also reviewers) to help change those habits.

    1. Quality of chemical structure accuracy in the database. I would imagine that one of the largest sources of error in the LOTUS database would be due to variation in the quality of chemical structures available. Are all structure-organism pairs based on fully resolved NMR-based structures are they based on mass spectral data with no confirmational information? At what point is a structural annotation accurate enough to be included in the database. More and more metabolomics studies are coming out and many of these contain compound annotations that could be included in the database, but what level (in silico, exact mass database search, or relative to a known standard) are required.

    This is a very interesting point and some databases have this “tag” (NMR, cristal, etc.). We basically rely on original published articles, included in specialized databases. If poorly reported structures have been accepted for publication, labelled as “identified” (and not “annotated”) and the authors publishing the specialized databases overlooked it, we might end up with such structures.

    Here, the Evidence Ontology (http://obofoundry.org/ontology/eco.html) might be a good direction to look at and further characterize the occurrences links in the LOTUS dataset.

    Reviewer #3 (Public Review):

    Due to missing or incomplete documentation of the LOTUS processes and software, a full review could not be completed.

    Some parts of LOTUS were indeed not sufficiently described and we improved both our documentation and accessibility to external users a lot. We thank the reviewer for insisting on this point as it will surely improve the adoption of our tool by the community.

  2. Evaluation Summary:

    Rutz et al. outline LOTUS, a new open-source database that links natural product structures with the organisms they are present in. It contains over 700,000 referenced structure-organism pairs and search tools that make mining the database intuitive and efficient. The LOTUS Initiative comprises an important data harmonization/integration effort over previous databases. The results are distributed to the public through Wikidata, which additionally supports future curation. This new resource is likely to be of great interest to natural product researchers as well as across fields of biology including ecology, evolution, and biochemistry.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    This manuscript addresses a major issue facing consumers of structure-organism pair data: the landscape of databases is very difficult to navigate due to the way data is made available (many resources do not have structured data dumps) and the way data is standardized (many resources' structured data dumps do not standardize their nomenclature or use stable entity identifiers). The solution presented is a carefully constructed pipeline (see Figure 1) for importing data, harmonizing/cleaning it, automating decisions about exclusions, and reducing redundancy. The results are disseminated through Wikidata to enable downstream consumption via SPARQL and other standard access methods as well as through a bespoke website constructed to address the needs of the natural products community. The supplemental section of the manuscript provides a library of excellent example queries for potential users. The authors suggest that users may be motivated to make improvements through manual curations on Wikidata, through semi-automated and automated interaction with Wikidata mediated by bots, or by addition of importer modules to the LOTUS codebase itself.

    Despite the potential impact of the paper and excellent summary of the current landscape of related tools, it suffers from a few omissions and tangents:

    1. It does not cite specific examples of downstream usages of structure-organism pairs, such as an illustration on how this information in both higher quantity and quality is useful for drug discovery, agriculture, artificial intelligence, etc. These would provide a much more satisfying bookend to both the introduction and conclusion.

    2. The mentions of recently popular buzzwords FAIR and TRUST should be better qualified and be positioned as a motivation for the work, rather than a box to be checked in the modern publishing climate.

    3. The current database landscape really is bad; and the authors should feel emboldened to emphasize this in order to accentuate the value of the work, with more specific examples on some of the unmaintained databases

    4. While the introduction and supplemental tables provide a thorough review of the existing databases, it eschews an important more general discussion about data stewardship and maintenance. Many databases in this list have been abandoned immediately following publication, have been discontinued after a single or limited number of updates, or have been decommissioned/taken down. This happens for a variety of reasons, from the maintainer leaving the original institution, from funding ending, from original plans to just publish then move on, etc. The authors should reflect on this and give more context for why this domain is in this situation, and if it is different from others.

    5. Related to data stewardship: the LOTUS Initiative has ingested several databases that are no longer maintained as well as several databases with either no license or a more restrictive license than the CC0 under which LOTUS and Wikidata are distributed. These facts are misrepresented in Supplementary Table 1 (Data Sources List), which links to notes in one of the version controlled LOTUS repositories that actually describes the license. For example, https://gitlab.com/lotus7/lotus-processor/-/blob/8b60015210ea476350b36a6e734ad6b66f2948bc/docs/licenses/biofacquim.md states that the dataset has no license information. First, the links should be written with exactly what the licenses are, if available, and explicitly state if no license is available. There should be a meaningful and transparent reflection in the manuscript on whether this is legally and/or scientifically okay to do - especially given the light that many of these resources are obviously abandoned.

    6. The order of sections of the manuscript results in several duplicated, but not further substantiated explanations. Most importantly, the methods should be much more specific throughout and the results/discussion should more heavily cross-link to it, as a reader who examines the paper from top to bottom will be left with large holes of misunderstanding throughout.

    7. The work presented was done in a variety of programming languages across a variety of repositories (and even version control systems), making it difficult to give a proper code review. It could be argued that the most popular language in computational science at the moment is Python, with languages like R, Bash, and in some domains, still, Java maintaining relevance. The usage of more esoteric languages (again, with respect to the domain) such as Kotlin hampers the ability for others to deeply understand the work presented. Further, as the authors suggest additional importers may implemented in the future, this restricts what external authors may be able to contribute.

    8. As a follow up to the woes of point 4., 5., and 7., the manuscript fails to reflect on the longevity of the LOTUS Initiative. Like many, will the project *effectively* end upon publication? If not, what institutions will be maintaining it for how long, how actively, and with what funding source? If these things are not clear, it only seems fair to inform the reader and potential user.

    9. Overall, there were many opportunities for introspection on the shortcomings of the work (e.g., the stringent validation pipeline could use improvement). Because this work is already quite impactful, I don't think the authors will be opening themselves to unfair criticism by including more thoughtful introspection, at minimum, in the conclusions section.

    10. Given the competitive nature of building databases and scientific publishing, it remains to be seen whether new database builders will contribute directly to the LOTUS Initiative, but the system the authors described seems to be prepared to support its maintainers to continue to import new databases as long as they are actively working on the project.

    Overall, this manuscript served as an excellent survey of the landscape of the structure-organism databases, the deep ties to natural product databases, and presents an obviously useful resource that will greatly simplify and improve the lives of other scientists who want to use this kind of data. It had a good focus and met the goals that it set in its abstract and introduction, and described the journey quite elegantly.

  4. Reviewer #2 (Public Review):

    Rutz et al. introduce a new open-source database that links natural products structures with the organisms they are present in (structure-organism pairs). LOTUS contains over 700,000 referenced structure-organism pairs, and their web portal (https://lotus.naturalproducts.net/) provides a powerful platform for mining literature for published data on structure-organism pairs. Lotus is built within the computer-readable Wikidata framework, which allows researchers to easily contribute, edit and reuse data within a clear and open CC0 license. In addition to depositing the database into Wikidata, the authors provide many domain-specific resources, including structure-based database searches and taxon-oriented searches.

    Strengths:

    The Lotus database presented in this study represents a cutting-edge resource that has a lot of potentials to benefit the scientific community. Lotus contains more data than previous databases, combines multiple resources into a single resource.

    Moreover, they provide many useful tools for mining the data and visualizing it. The authors were thoughtful in thinking about the ways that researchers could/would use this resource and generating tools to make it ways to use. For example, their inclusion of structure-based searches and multiple taxonomy classification schemes is very useful.

    Overall the authors seem conscientious in designing a resource that is updatable and that can grow as more data become available.

    Weaknesses/Questions:

    1. Overall, I would like to know to what degree LOTUS represents a comprehensive database. LOTUS is clearly, the best database to date, but has it reached a point where it is truly comprehensive, and can thus be used for a metanalysis or as a data source for research questions. Can it truly replace doing a manual literature search/review?

    2. Data Cleaning & Validation. The manuscript could be improved by adding more details about how and why data were excluding or included in the final upload. Why did only 30% of the initial 2.5 million get uploaded? Was it mostly due to redundant data or does the data mining approach result in lots of missed data?

    1. Similarly, more information about the accuracy of the data mining is needed. The authors report that the test dataset (420 referenced structure-organisms pairs) resulted in 97% true positives, what about false negatives? Also, how do we know that 420 references are sufficiently large to build a model for 2.5M datapoints? Is the training data set is sufficiently large to accurately capture the complexities of such a large dataset?

    2. Data Addition and Evolution: The authors have outlined several mechanisms for how the LOTUS database will evolve in the future. I would like to know if/how their scripts for data mining will be maintained if they will continue to acquire new data for the database. To what extent does the future of LOTUS depend on the larger natural products community being aware of the resource and voluntarily uploading to it? Are there mechanisms in place such as those associated with sequencing data and NCBI?

    3. Quality of chemical structure accuracy in the database. I would imagine that one of the largest sources of error in the LOTUS database would be due to variation in the quality of chemical structures available. Are all structure-organism pairs based on fully resolved NMR-based structures are they based on mass spectral data with no confirmational information? At what point is a structural annotation accurate enough to be included in the database. More and more metabolomics studies are coming out and many of these contain compound annotations that could be included in the database, but what level (in silico, exact mass database search, or relative to a known standard) are required.

  5. Reviewer #3 (Public Review):

    Due to missing or incomplete documentation of the LOTUS processes and software, a full review could not be completed.

    The authors and editors have been provided with specific questions and comments in an effort to resolve apparent documentation issues.