Defining the limits of plant chemical space: challenges and estimations

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

The plant kingdom, encompassing nearly 400,000 known species, produces an immense diversity of metabolites, including primary compounds essential for survival and secondary metabolites specialized for ecological interactions. These metabolites constitute a vast and complex phytochemical space with significant potential applications in medicine, agriculture, and biotechnology. However, much of this chemical diversity remains unexplored, as only a fraction of plant species have been studied comprehensively. In this work, we estimate the size of the plant chemical space by leveraging large-scale metabolomics and literature datasets. We begin by examining the known chemical space, which, while containing at most several hundred thousand unique compounds, remains sparsely covered. Using data from over 1,000 plant species, we apply various mass spectrometry-based approaches—a formula prediction model, a de novo prediction model, a combination of library search and de novo prediction, and MS2 clustering—to estimate the number of unique structures. Our methods suggest that the number of unique compounds in the metabolomics dataset alone may already surpass existing estimates of plant chemical diversity. Finally, we project these findings across the entire plant kingdom, conservatively estimating that the total plant chemical space likely spans millions, if not more, with the vast majority still unexplored.

Article activity feed

  1. model, a de novo prediction model, a combination of library search and de novo prediction, and MS2 clustering—to estimate the number of unique structures. Our methods suggest that the number of unique compounds in the metabolomics dataset alone may already surpass existing estimates of plant chemical diversity. Finally, we project these findings across the entire plant kingdom, conservatively estimating that the total plant chemical space likely spans millions, if not more, with the vast majority still unexplored.

    Reviewer name: Kohulan Rajan Reviewer Comments: Review: Defining the limits of plant chemical space: challenges and estimations This work presents an important contribution to understanding the chemical diversity of plants through a systematic analysis combining metabolomics data and literature mining. The authors address a question in the field and employs multiple complementary approaches to estimate the size of the plant chemical space. Here are my few suggestions and question to the authors to clarify,

    1. When introducing an abbreviation one could use caption letters "Natural Products (NP)"
    2. There is no list of abbreviations in the document, so introduce them first and then use them. There may be some readers who are unfamiliar with the terms COCONUT and LOTUS.
    3. Is there any prior work using similar combined metabolomics/literature approaches to estimate plant chemical space? If so, these should be cited. If not, please state this explicitly to highlight the novelty of your method.
    4. Cite SMILES
    5. While the paper describes the use of 'literature datasets,' it appears that only existing databases (COCONUT and LOTUS) are being utilized. It would be helpful if authors could clarify whether any direct literature mining was conducted. If not, consider revising terminology to more accurately reflect the use of curated databases rather than primary literature sources.
    6. Great to see the data and code openly shared on both Zenodo and GitHub. I also find the GitHub repository very useful with regard to all the provided notebooks. To maximize reusability, please consider adding a detailed "How to Use" section to the README that guides others in replicating or building upon this work.
    7. The different clustering thresholds (0.7 vs 0.8) lead to notably different estimates. Could you discuss which threshold might be more appropriate for this specific application to plant metabolomics data?
  2. The plant kingdom, encompassing nearly 400,000 known species, produces an immense diversity of metabolites, including primary compounds essential for survival and secondary metabolites specialized for ecological interactions. These metabolites constitute a vast and complex phytochemical space with significant potential applications in medicine, agriculture, and biotechnology. However, much of this chemical diversity remains unexplored, as only a fraction of plant species have been studied comprehensively. In this work, we estimate the size of the plant chemical space by leveraging large-scale metabolomics and literature datasets. We begin by examining the known chemical space, which, while containing at most several hundred thousand unique compounds, remains sparsely covered. Using data from over 1,000 plant species, we apply various mass spectrometry-based approaches—a formula prediction

    Reviewer name: Carlos RodrÃ-guez-López Reviewer Comments In the reviewed manuscript, Chloe Engler Hart et al. utilize different approaches to estimate the size of plant chemical space through analysis of publicly available datasets of mass spectrometry-based metabolomics. The authors tackle this issue by using data from ca. 2,000 LC-MS runs, and different formula predictors and structure annotation algorithms, and extrapolate to the estimated number of plant species. While the approach is useful at estimating structural variation, and the collected data and here-published source code can certainly be of use to the plant metabolomics community, I consider the manuscript requires modifications before it can be recommended for publication. Particularly, the language of the article should more accurately reflect the nature of this estimate; for example, mentions of the approach being "the most accurate estimate possible" (p.8, section 3.2) are not supported, and throughout the article, mentions of the calculation as a "conservative estimate" are not consistent with the approaches used, beyond formula prediction. E.g. it is mentioned that the MS2 curve being lower than formula prediction suggests that the curves may be conservative without further clarification on why this might be the case and not, e.g., a product of estimates dispersion. The authors mention that since they identify most limitations (in table 2, p. 13) are underestimations (again, with limited or no explanation) their estimate is conservative. Since no effect size can be calculated on these limitations, this statement is not true; e.g. if the approach is missing half of molecules due to extraction, and another half due to tissue coverage (total, ¼), but overestimating the plateau of plant chemical diversity by 100-fold, even if more factors underestimate the chemical space, the effect size of the latter would be dominant by far. I recommend the authors to change mentions of this estimate being a conservative approach, and instead clearly mention that this is a fragmentation-based estimate, or a similar term that better reflects the nature of the figure. Similarly, assumptions on the models should be explicitly stated, along with their limitations. The authors, for example, rely on CID induced fragmentation, and they mention that the estimate "[relies] on the predominant adduct ([M+H]+)" (p.15) and thus "this likely underestimates the true chemical diversity, as other adduct forms" (p.15). It should be stated that this is an assumption: the authors do not have evidence for the adducts being [M+H]+, which is nigh impossible with the available data, they are assuming all features are [M+H]+ adducts. This carries the implicit assumption that fragmentation mechanisms will be the same for all MS2 spectra and thus structural diversity can be estimated through MS2 clusters. It is unclear how this would yield an underestimation, as the authors claim, but rather yields an overestimation, as fragmentation of [M+H]+ and e.g. [M+Na]+ adducts of the same molecule would yield different fragmentation patterns, given the former favors charge migration dependent mechanisms compared to the latter. Thus, since the authors consider all features to be [M+H]+, two adducts of the same molecule might be considered as different moieties, given that fragmentation patterns will differ, even if no difference exists. On the same vein, since similarity thresholds of the MS2Mol algorithm are essential for the estimation of diversity, the authors should clearly state how are they calculated in text, not by reference, along with potential limitations. Finally, I believe the work would greatly benefit from including data on phylogenetics of the samples, adding diversity estimates to their sample and extrapolation data. If, for example, most of the 400,000 plant species are phylogenetically distant from the sampled species, then the reader can reasonably assume that this might be an underestimation of chemical diversity when presented with the evidence. If, on the other hand, the original sample has more diversity than the total number of plant species, this might not be the case. In any case, all of the relevant assumptions should be clearly stated. Minor note: One of the main arguments for extrapolating the diversity estimate into the rest of the plants comes from Figure 3D, where increasing MS1 adducts increases with number of samples; it would greatly help explaining the difference seen between species if the authors clarify the tissues sampled per species. E.g. if the species that only doubles the number of features contains only aerial and vegetative tissue, compared to the species that increases 6fold which might include root or reproductive tissue, etc. This might also help the authors in justifying the extrapolation of the estimate.