METAGENOME-ASSEMBLED GENOMES FROM A POPULATION-BASED COHORT UNCOVER NOVEL GUT SPECIES AND STRAIN DIVERSITY, REVEALING PREVALENT DISEASE ASSOCIATIONS
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Review Commons)
Abstract
Metagenomic profiling has advanced understanding of microbe-host interactions. However, widely used read-based approaches are limited by incomplete reference databases and the inability to resolve strain-level variation. Here, we present a scalable, genome-resolved framework that integrates population-specific metagenome-assembled genomes (MAGs) to discover novel species, strain diversity, and disease associations. From 1,878 deeply sequenced samples in the Estonian microbiome cohort (EstMB-deep), we reconstructed 84,762 MAGs representing 2,257 species, including 353 (15.6%) previously uncharacterized species reaching up to 30% relative abundances in some individuals. We integrated these MAGs with the Unified Human Gastrointestinal Genome (UHGG) collection to create an expanded reference (GUTrep), enabling profiling of 2,509 EstMB individuals and testing associations with 33 prevalent diseases. Of 25 diseases with significant associations, 8 involved newly identified species, underscoring the value of population-specific MAGs. To quantify within-species diversity, we developed the Strain Richness Index (SRI), a novel MAG-based metric that informed strain-level analyses. Based on SRI, we prioritized Odoribacter splanchnicus , a prevalent species with the lowest strain heterogeneity, yielding sufficient power for strain-level analysis. We identified two dominant strains, N1 and N2, with distinct gene repertoires and divergent disease associations. Notably, strain N1 was negatively associated with gastritis and duodenitis and hypertensive heart disease, associations undetected at the species level. Our study expands the human gut reference landscape, demonstrates the importance of population-specific MAGs for uncovering novel microbial diversity, and reveals strain-level disease associations obscured at higher taxonomic levels, highlighting the need for genome-resolved approaches in microbiome research.
Article activity feed
-
Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
Reviewer #1
Drawbacks: -While the population-specific approach is a strength, it also limits the direct applicability of findings to other populations.
We thank the Reviewer for highlighting this important question. While we acknowledge the mentioned limitation, we would like to emphasize the benefits of adopting a population-specific approach, especially given that human gut microbiome diversity remains underexplored in many populations worldwide. Researching the Estonian population microbiome, we contribute to the broader global collection of gut microbial species, helping to address this gap.
Moreover, new microbial species and strains identified …
Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
Reviewer #1
Drawbacks: -While the population-specific approach is a strength, it also limits the direct applicability of findings to other populations.
We thank the Reviewer for highlighting this important question. While we acknowledge the mentioned limitation, we would like to emphasize the benefits of adopting a population-specific approach, especially given that human gut microbiome diversity remains underexplored in many populations worldwide. Researching the Estonian population microbiome, we contribute to the broader global collection of gut microbial species, helping to address this gap.
Moreover, new microbial species and strains identified in the Estonian population may be relevant for populations with similar environmental and lifestyle factors, such as the Finnish, Baltic, and Nordic populations. These findings can enhance understanding of regionally relevant microbiome characteristics and may serve as a useful reference for studies in these related populations. As more population-based microbiome research is published, it will build a valuable resource for cross-population comparative studies, shedding light on global microbiome diversity and its implications for health.
Lastly, as part of the Estonian Biobank, our primary objective is to advance personalized medicine for the Estonian population. This requires a highly accurate reference for our specific population. We believe our approach not only benefits Estonian healthcare but also provides insights and methodologies that other population biobanks may find valuable as they embark on similar paths toward personalized medicine.
-The study primarily focuses on taxonomic composition at the genus or species level, but a more in-depth functional analysis of the novel species could provide additional insights.
We thank the Reviewer for this valuable addition. Functional analysis plays a crucial role in understanding the mechanisms that link the microbiome to human health, making it an essential. This becomes even more critical when studying newly discovered species. However, before embarking on functional analysis, we believe it is important to emphasize that, while high-quality metagenome-assembled genomes (MAGs) provide valuable insights, they do not fully represent the genomic completeness and accuracy of genomes reconstructed from pure bacterial cultures. Acknowledging this distinction was one of the reasons we decided not to include functional analysis in the original article. With these considerations in mind, we research a strain structure of four known species of Butyricimonas genus. While the primary interest lies in species associated with diseases, this particular species lacks a substantial number of high-quality MAGs. To gain deeper insights, we prioritized including a new species within the analyzed genus to perform a comparative analysis between the new species and a well-defined strain of a known species, creating a more comprehensive understanding. Among the 758 different genera present in our MAG collection, we selected the Butyricimonas genus for the following reasons: (1) it is a well-described genus of gut bacteria, represented by 300 high-quality MAGs in our dataset (2) it contains four known species along with two newly identified species clusters, and (3) the newly discovered species were shown to be prevalent in the human gut microbiome, being detected in more than 50% of samples through mapping.
The following section was integrated in the new paragraph “Genome level analysis of species of interest” on page 6 in the revised version of the manuscript:
“Species-level association studies can help identify candidates for genome-level analysis by exploring strain structure and functional differences. However, such analyses require a large number of high-quality MAGs from the same species, which is only feasible within large cohorts with deep sequencing data. While we currently need more samples to obtain sufficient MAGs for the new disease-associated species, we perform an analysis with the Butyricimonas genus species as an example. We show that the assembled MAGs of Butyricimonas species such as B. faeciominis, B. virosa, B. paravirosa and B. faecalis make up different strains (Figure 4a, Figure 4b, Supplementary results, Supplementary Table S5). After selecting a strain representative, we conducted a pan-genome analysis of species and strain-representative MAGs, including the two new species. The analysis revealed unique gene clusters consistently present in the new species but absent in all other analyzed species and strains (Figure 4c, Supplementary results, Supplementary Table S6).
*Figure 4. Strain-level structure of the Butyricimonas genus and comparative functional analysis of new species and known species strain. a. *The strain structure of known Butyricimonas species assembled in the Estonian population - B. paravirosa, B. faecalis, B. virosa, and B. faecihominis (based on ANI index comparison). __b. __Butyricimonas genus structure. Comparisons include all known species from Butyricimonas genus (species assembled in Estonian population and publically available species) and all 4 newly assembled MAGs belonged to a new species. Publicly available Butyricimonas species - B. synergistica, "Candidatus B. faecavium", "Candidatus B. hominis", "Candidatus B. phoceensis", and "Candidatus B. vaginalis"—are each represented by a single genome of the type strain (the strain defining the species according to ISCP). Species assembled from our data are represented by both the type strain and all strain-representative MAGs. ANI values less than 95% (represent that MAGs belonged to different species) are not coloured, 95–100% ANI colored in different colors with 1% step. *c. *Pan-genome analysis of Butyricimonas genus. The analysis included the same genomes and MAGs as the analysis of the Butyricimonas genus structure and showed a core gene, as well as specific gene, set for the species. The two new species clusters (highlighted in green) also exhibit unique species-specific gene sets.
We have also added Supplementary Results to our paper, providing a more detailed description of the strain structure analysis of Butyricimonas species and the functional analysis of both known and new species. We chose not to include this in the main text to avoid shifting the focus of the paper.
Supplementary results
*Butyricimonas genus species strain-level and functional analysis *
Beyond taxonomic characterisation, it is crucial to understand the functional differences of newly detected species, as this insight is key to fully understanding the mechanisms that link the microbiome to human health. Reconstructing MAGs from a large cohort provides multiple genomes of the same species, particularly for prevalent species. During our study, we assembled MAGs from 758 different genera, including 358 genera with more than 10 extracted MAGs. Conducting a detailed in-depth strain-level and functional analysis of all these genera requires substantial effort. Therefore, we conduct an in-depth strain-level and functional analysis using the genus Butyricimonas as an example, because. The genus Butyricimonas was chosen for the following reasons: (1) it is a well-characterized genus of gut bacteria, represented by 300 high-quality MAGs in our dataset (2) it included four known species and two newly identified species clusters, and (3) the new discovered species have been shown to be prevalent in the human gut microbiome.
*Known Butyricimonas species exhibit a clear strain-level structure based on pairwise ANI comparisons (ANI > 99.0), as calculated using ANIclustermap19 (Figure 4a). From a total of 300 high-quality MAGs selected for strain and functional analysis within the Butyricimonas genus, the species Butyricimonas paravirosa is represented by 23 MAGs and forms 5 distinct strain clusters. While one big cluster (cluster_id: B30) includes 7 highly similar genomes with ANI values close to 100%, other clusters (B31, B32, B34) exhibit more genomic diversity, with genomes showing ANI values greater between 99.0% and 99.6%. The final cluster (B33) contains a single MAG, suggesting unique genomic variation. Butyricimonas faecihominis is represented by 65 MAGs and forms 8 distinct strain clusters, exhibiting high genome similarity within each cluster. Butyricimonas virosa is represented by 67 MAGs and forms 14 distinct strain clusters. These strain clusters can be divided into two strain cluster groups, with low similarity between the groups (ANI values between strain cluster groups ranging from 95.0% to 96% and approaching the species boundary). Within each group, the strain clusters also exhibit genomic diversity, indicating a substantial level of variation even within closely related strains. Finally, Butyricimonas faecalis has the highest number of MAGs within its species 141 MAGs and shows a clean picture of 5 strain clusters with high similarity within the strain cluster (Figure SR1). *
Figure SR1.* The strain structure of known Butyricimonas species assembled in the Estonian population - B. paravirosa, B. faecalis, B. virosa, and B. faecihominis (ANI index comparison histogram).*
In addition to the four known species, we assembled two new species within the Butyricimonas genus. The first new species cluster (id: Bn1) is represented by a single MAG (H0366_Butyricimonas_undS), which serves as the representative genome for this species. The second new species cluster (id: Bn2) comprises three MAGs, with H1068_Butyricimonas_undS designated as the representative genome, selected using dRep. To determine the placement of these new species within the genus, we conducted genome pairwise comparisons based on the Average Nucleotide Identity (ANI) index between the MAGs of the new species and other species within the Butyricimonas genus. For the known species identified in our population, we selected representative genomes for each strain. These comparisons were made between the all new species MAGs, strain-level representative MAGs of four known species, and type strain genomes (the strain that defines the species according to ISCP) from other species of the Butyricimonas genus that were not present in our cohort,, such as Butyricimonas synergistica, "Candidatus Butyricimonas faecavium", "Candidatus Butyricimonas hominis", "Candidatus Butyricimonas phoceensis", and "Candidatus Butyricimonas vaginalis" (Figure 4b). The MAGs from the second new species cluster (Bn2) form a distinct and cohesive group, showing a closer relationship to Butyricimonas paravirosa and Butyricimonas faecihominis. In contrast, the first new species (Bn1), represented by a single MAG, is positioned closer to Butyricimonas virosa. Interestingly, while the ANI index between the type strain of Butyricimonas virosa and the Bn1 MAG is less than 95%, certain strains of B. virosa (e.g., strains 3, 6, 7, 9, 10, and 12) show ANI values slightly above 95%, which technically classifies them as the same species.
To explore functional differences between new species clusters and other known species we perform pangenomic analysis using the analysis and visualization platform for ‘omics data (Anvi’o) workflow for microbial pangenomics20__. __As the first new species cluster (id:Bn1) is represented by a single MAG, despite it containing unique genes not found in any other analyzed genomes, it is challenging to draw definitive conclusions. Another new species cluster (id:Bn2) consisting of three MAGs provides clearer insights. All three MAGs within this new species cluster share 183 unique genes that are consistently present across the species cluster but absent in all other analyzed species and strains. (Figure 4c). The majority of these genes (142 genes, 73.96%) have unknown functions. Among the genes with defined functions, the functions are distributed across various COG categories (Suppl. Table S5,____Suppl. Figure SR2), with the top three categories being “Cell wall/membrane/envelope biogenesis”, “General function prediction only”, and “Posttranslational modification, protein turnover, and chaperones”.
Figure SR2.* COG categories for 183 unique genes that are consistently present across the new species MAGs from Butyricimonas genus (cluster id:Bn2) but absent in all other analyzed species and strains.*
Undoubtedly, further research is needed to understand the role of newly identified species in the human microbiome and to determine whether strain-level differences influence bacterial interactions with the gut and their overall impact. However, our current analysis has already significantly expanded our knowledge of the diversity within this genus. It has added two new species to the ten previously described and revealed the strain structure of known species within the Estonian population.
-Is it possible for this large dataset to distill information and have plots for strain diversity of abundant and prevalent species, including low abundance species per donor or between donors? Can authors add such a plot or discuss this?
We thank the Reviewer for this insightful question. Strain-level analysis holds significant potential and is one of the key reasons to use the genome assembly approach, rather than relying on microbiome community profiling using existing human gut species databases. To demonstrate how this can be applied in large datasets like ours, we focused on the same *Butyricimonas *genus selected for functional analysis. We believe that combining both strain-level and functional analyses provides a more comprehensive understanding when used together.
The following section has been incorporated into a new paragraph, “Genome-Level Analysis of Species of Interest,” on page 6 of the revised manuscript, and in-depth analysis has been included in the Supplementary Results. As this section has already been cited in a previous response (due to its logical connection with the functional analysis of the new species), we will not cite it again here. Please refer to the previous answer for further details.
-While associations between microbes and diseases were found, the study design cannot establish causal relationships. Are the authors planning to test some of the associations experimentally and see whether these observations work in vitro or in vivo?
We agree that elaboration of causal relationships is crucial. However, this was beyond the scope of the current study, which is intended as a foundational step for future investigations. However, the samples are stored in the Estonian Biobank in a way that allows culturomic studies and follow-up experiments as done by Krigul et al [1].
Krigul KL, Feeney RH, Wongkuna S, Aasmets O, Holmberg SM, Andreson R, Puértolas-Balint F, Pantiukh K, Sootak L, Org T, Tenson T, Org E, Schroeder BO. A history of repeated antibiotic usage leads to microbiota-dependent mucus defects. Gut Microbes. 2024 Jan-Dec;16(1):2377570. doi: 10.1080/19490976.2024.2377570.
Minor comments:
- The authors could provide more context on how their findings compare to similar studies in other populations. What are the differences and similarities, and how does this work at the next level and set new directions?
We thank the Reviewer for this suggestion. We provided a summary of other population cohorts in the Introduction (Lines 79–90). Since MAG recovery from large cohorts is a relatively new approach, there are limited opportunities for direct comparison. However, we did note a decreasing number of newly recovered species in our study compared to previous studies (Lines 274–290).
- Figures' quality and readability can be improved easily; all of them are low resolution, and the axes are hardly visible, particularly Figure 2, which could benefit from additional labeling or explanations in the legend to improve clarity.
We apologize for the quality issues with the figures. We completely revised Figure 2 to improve clarity and placed a new higher-resolution version of Figure 2 to improve readability, ensuring that axes and details are clearly visible.
Summary of performed changes: (1) we introduced a new Figure 2a to showcase the phylogenetic diversity of the recovered species and highlight the position of the newly assembled species identified for the first time in this study (2) We have updated Figure 2b. In the initial figure, a single line was presented. However, to enhance the visualization and emphasize the trend, five lines were subsequently plotted by altering the order of the samples. Since the order of the samples is not significant, this modification allows for a clearer representation of the overall trend of accumulation of the new species (3) we added new Figure 2c, to address the question about the range of diversity of detected species (4) we moved Figure 2a and 2d to Supplementary Figures to enhance clarity and relevance (Figure S4 and Figure S6 respectively).
“Figure 2. Overview of species from the EstMB MAG collection a. Phylogenetic tree of the Estonian species representative MAGs. The inner circle displays a phylogenetic tree of species cluster representative MAGs, with branches colored according to their assigned phylum in the Genome Taxonomy Database (GTDB) (see color text). The surrounding ring highlights MAGs that represent novel species assembled in the current study, using the same colors as in the inner circle to indicate the phylum to which each new species belongs (see color text). b. The relationship between the number of samples analyzed and the cumulative number of new species identified c. Distribution of number of species detected by mapping per sample “species hits” (yellow color violinplot) and number of recovered MAGs per sample (blue color violinplot) from Estonian representative MAGs number. d. Number of recovered species (blue color dots) and species detected by mapping the reads against the EstMB MAG collection (yellow color dots) for each sample. Samples are sorted from those with the highest to the lowest number of recovered MAGs __e. __The prevalence and number of recovered MAGs per species. The top 10 species with the highest number of recovered MAGs are shown. Blue bars represent the number of samples where MAG of the species were recovered, while gray bars show the species prevalence in EstMB f. The prevalence and number of recovered MAGs per new species. The top 10 new species with the highest number of recovered MAGs are shown. Green bars represent the number of samples where MAG of the new species were recovered, while gray bars show the new species prevalence.”
-A brief discussion on the potential clinical implications of the new species-disease associations would enhance the relevance. Why discovering new species are in testing and relevant for the microbiome field? Can authors add this somewhere, discussion?
We thank the Reviewer for this suggestion. As such, the following section was integrated in the Discussion on page 8 in the revised version of the manuscript:
“Reconstruction of a new species and new strain is critical for many aspects of personal medicine. We can identify three primary applications of the microbiome in personalized medicine: disease risk assessment and prevention, disease diagnosis, and disease treatment. The latter includes approaches such as microbial supplementation, suppression, or metabolite modulation [Karina Ratiner, 2024]. Both disease prevention and diagnosis rely on identifying bacterial biomarkers associated with prevalent or incident disease cases. In our study, an average of 4% of reads belonged to the newly identified species, with a maximum of 34.76%, demonstrating that excluding this species would lead to a significant loss of community diversity. This omission could potentially exclude biomarkers critical for disease prediction and diagnosis. Notably, one-third of the associations between bacterial species and diseases in our analysis involved the newly identified species, further emphasizing its potential importance as a biomarker. For disease treatment, it is crucial to understand the complete microbial diversity to distinguish between beneficial and harmful species. Equally important is knowing the genomic structure of species and strains to develop effective strategies for microbiome modulation. Without genome assembly, we are limited to assumptions based on previously described genomes of related bacteria. However, given the substantial genomic diversity within species, such assumptions may be highly inaccurate, underscoring the importance of genome assembly in advancing microbiome-based interventions.”
- In lines 265-266, the authors discuss detected species per sample, on average, 389 species. Can the authors guide which plot is linked to it and whether it is possible to show the disturbing median number of species per sample to get an overall idea about the range of diversity this type of analysis can capture now? Maybe this will improve in the future; it is worth mentioning here.
We thank the Reviewer for highlighting the need for the clarification. Original Figure 2c displayed the number of species detected through mapping (species hits) and the number of assembled MAGs for each individual sample. To provide a broader characterization of the distribution, we calculated the minimum, mean, median, and maximum values across all samples. As such, the __new Figure 2c __and the following section was integrated in the paragraph “Estimation of species prevalence using population-specific reference” on page 5 in the revised version of the manuscript:
“Distribution of the number of species detected by mapping per sample exhibits a wide range of values, with a maximum of 842 and a minimum of 7, while the mean and median are 399 and 405, respectively. The distribution of numbers of recovered MAGs per sample shows a narrower range, with a maximum of 155 and a minimum of 1, alongside a mean of 45 and a median of 41 (Figure 2c).”
Figure 2c.* Distribution of number of species detected by mapping per sample “species hits” (yellow color violinplot) and number of recovered MAGs per sample (blue color violinplot). *
Other comments:
-The key conclusions are generally convincing. The authors have successfully assembled a large number of MAGs from the Estonian population, identified potentially novel species, and established associations between microbial abundance and diseases.
We appreciate the Reviewer's positive feedback on our findings. We are pleased that the significance of our MAG assembly, novel species identification, and disease associations is well-received.
-The data presented appear to support the claims well. However, the authors should emphasize and clarify that the disease associations are correlational, not causal, and further validation is required.
We agree that this is an important point to emphasize. We revised the manuscript to clarify that the disease associations are correlational and emphasize the need for further validation by adding the following section in Discussion on page 8 in the revised version of the manuscript:
“While association does not imply causation, analyzing the association between bacterial species and diseases is a crucial first step in identifying potential biomarkers. This can be followed by meta-analyses across different cohorts and laboratory experiments to validate and confirm the observed effects.”
-Even though I am not an expert in metagenomics analysis, the current experimental design and analysis are sound to support the main claims.
We thank the Reviewer for recognizing the robustness of our experimental design and analysis.
-The methods section can be improved by providing more details about how samples were collected and stored and how long after storage gDNA was extracted and processed for sequencing, allowing for reproducibility. The authors provide information on the bioinformatics pipelines, including software versions and parameters, but this can again be improved by adding details about the steps between sample processing and raw data processing.
We thank the Reviewer for this suggestion and we agree that this is important information. All these details were thoroughly described in our previous paper, which focuses on our cohort description (Aasmets, O., Krigul, K.L., Lüll, K., Metspalu, A., and Org, E. (2022). Gut metagenome associations with extensive digital health data in a volunteer-based Estonian microbiome cohort. Nat. Commun. 13, 869.
https://doi.org/10.1038/s41467-022-28464-9).
However, to improve accessibility of this information, the following paragraph was integrated in the Methods on page 17 in the revised version of the manuscript:
“Microbiome sample collection and DNA extraction
The participants collected a fresh stool sample immediately after defecation with a sterile Pasteur pipette and placed it inside a polypropylene conical 15 mL tube. The participants were instructed to time their sample collection as close as possible to the visiting time in the study centre The samples were stored at −80 °C until DNA extraction. The median time between sampling and arrival at the freezer in the core facility was 3 h 25 min (mean 4 h 34 min) and the transport time wasn’t significantly associated with alpha (Spearman correlation, p-value 0.949 for observed richness and 0.464 for Shannon index) nor beta diversity (p-value 0.061, R-squared 0.0005). Microbial DNA extraction was performed after all samples were collected using a QIAamp DNA Stool Mini Kit (Qiagen, Germany). For the extraction, approximately 200 mg of stool was used as a starting material for the DNA extraction kit, according to the manufacturer’s instructions. DNA was quantified from all samples using a Qubit 2.0 Fluorometer with a dsDNA Assay Kit (Thermo Fisher Scientific).”
-The study includes a large cohort (1,878 samples), which provides statistical power. The statistical analyses, including linear regression models adjusted for BMI, gender, and age, seem appropriate for the type of data presented. I suggest adding a separate paragraph about how the data is processed and statistically analyzed.
Authors should include:
- Appropriateness of the statistical tests used for the data types and experimental designs
- Adequate description and justification of the statistical models and test and assumptions
- Proper handling of replicates, controls, and data normalization
- Reporting of effect sizes, sample size, confidence intervals, and statistical power
- Data processing and analysis workflows.
We thank the Reviewer for this recommendation. To highlight the statistical analysis carried out, we have made a separate paragraph for statistical analysis under the Methods section (lines 617-628). We note that we have previously described data processing and normalization. This study has an exploratory nature. Hence, the power calculations are not applicable, but this study can be an input for the power calculations of future studies testing statistical hypotheses. However, we agree that the sample sizes for each phenotype and beta estimation would support our results. We have now added them to __Table 1____. ____ __
Reviewer #1 (Significance (Required)):
-This study represents an advance in the context of population-specific studies. Creating a comprehensive Estonian population-specific MAG reference and identifying new species contribute to our understanding of microbiome diversity.
-The work builds upon previous large-scale microbiome projects, such as those that established the Unified Human Gastrointestinal Genome (UHGG) collection but focuses on a specific population.
-The associations between microbial species (including novel ones) and common diseases provide potential avenues for future research into microbiome-based diagnostics or therapeutics.
-The findings would interest microbiome researchers, bioinformaticians, and clinicians interested in the role of the gut microbiome in health and disease.
We thank the Reviewer for the thoughtful feedback and recognition of our study's contributions to microbiome research. By creating an Estonian population-specific MAG reference and identifying new species, we advance population-specific studies and enhance global microbiome diversity. Building on projects like UHGG, we integrate local data into the global context and highlight potential applications in microbiome-based diagnostics and therapeutics. To address your suggestions, we expanded the results section with an example from the Butyricimonas genus. We hope our publicly available data will support future research and further advance understanding of the gut microbiome in health and disease.
__ Reviewer #2 (Evidence, reproducibility and clarity (Required)):__
The manuscript by Pantiukh et al. presents the collection of MAGs assembled from the Estonian Biobank, with a specific focus on the novel species clusters the authors defined and found associations with some of the diseases as collected among the samples available in their biobank. The manuscript is well organized. However, it lacks a bit in terms of novelty and also some statements that can mislead the readers to overinterpret some parts.
Majors
- The last paragraph of the introduction (lines 91-98) anticipates some results but lacks some methodological details. Please consider whether to move it to the results section or add very brief specifications, like (1) "sequence with deep coverage" is vague, how deep is deep? (2) "84,762 MAGs representing 2,257 species" are the 84k MAGs already quality-controlled? (3) "353 MAGs (15,6%) of the EstMB MAGs collection to represent potentially novel species." 353 are MAGs or species? As species clusters are defined later at 95% ANI, are all these 353 defining their own species clusters?
We thank the Reviewer for insightful questions and suggestions. To address these points, we have added the following clarifications to the text:
We specified the depth of coverage for sequences, providing an average reads number per sample - 56 mln reads. (Lines 92). We clarified that among 84,762 assembled MAGs, 42,049 MAGs (49.60 %) were high-quality (HQ) MAGs. (Lines 93-94). We revised the statement about the 353 MAGs, explicitly noting that they represent potentially novel species. Additionally, we clarified that all 2,257 representative MAGs, including these 353 new species MAGs represent separate species clusters based on the 95% ANI threshold mentioned later in the text. (Lines 94-98).
In the paper, we included only the figure showing the quality group distribution for species cluster representative MAGs to avoid potential confusion between two similar figures: one for all assembled MAGs (n=84,762) and another for cluster representative MAGs (n=2,257). However, in response to this query, we have added a new __Supplementary Figure S1__that illustrates the quality group distribution for all assembled MAGs to provide a more comprehensive view.
Figure S1. Quality estimation for the assembled MAGs (n=84,762). High-quality MAGs (HQ) – 42,049; Medium-quality MAGs (MQ) – 26,806; Low-quality MAGs (LQ) – 15,907.
- lines 109 and 265, "11.73 +/- 3.9 Gb data per sample and 56.13 +/- 19.37 million reads per sample", numbers don't match... 11.73 Gbp is about 78M reads at 150nt read length, plus later the average depth is not 56.13 but 53.04, please double check these numbers
We apologize for any misunderstanding. The numbers mentioned in the paper refer to the number of reads and the file size of each compressed *.fasta.gz file. This file size does not directly represent the total base pairs (Gb) for the current metagenome. Instead, it reflects the disk space occupied by the compressed sequencing data, including additional information such as sequence headers. We selected this parameter to provide an easy point of comparison with file sizes from other metagenome sequencing datasets, as *.fasta.gz is a commonly used format for storing sequence data. To clarify further, here is an example of the relationship between these parameters for one sample:
Sample XX
Value
Meaning
Program
Compressed file size
4.2 GB
Represents disk space occupied by the compressed sequencing data. This applies to forward reads only; for a rough estimation of the disk space for both forward and reverse reads, it should be multiplied by 2 or calculated separately for both files.
du -sh V00HXZ.fq1.gz
The total number of reads
41,062,933 reads
(avg. read len = 147.7 bp)
Represents number of forward reads. This applies to forward reads only; for a rough estimation of both forward and reverse reads, it should be multiplied by 2 or calculated separately for both files.
seqkit stats V00HXZ.fq1.gz -a -T
Total base pairs (Gb)
6,066,493,002 bp (6.07 Gb)
Represents total base pairs (Gb) for the current sample. This applies to forward reads only; for a rough estimation of both forward and reverse reads, it should be multiplied by 2 or calculated separately for both files.
seqkit stats V00HXZ.fq1.gz -a -T
We now realize this may have caused confusion. To address this, we have calculated the total base pairs (Gb) parameter for both forward and reverse reads and exchanged the __Compressed file size __number to __Total base pairs__with following section in the paragraph “Cohort overview and study design” on page 3 in the revised version of the manuscript:
“The EstMB-deep samples were resequenced at deep coverage, generating an average of 16.49 ± 6.2 Gb of total base pairs per sample, or 56.13 ± 19.37 million paired reads per sample, with an average forward read length of 146.85 bp and an average reverse read length of 147.01 bp.”
- line 118, "completeness > 90% and contamination We thank Reviewer for this comment, we use CheckM v2 for evaluation MAG completeness and contamination. We have incorporated the requested information into the manuscript. (Lines 128).
- line 120, "84,762 MAGs were clustered at the species level with an average nucleotide identity (ANI) threshold of 95%.", as for my previous comment, either specify the Methods or quickly mention the tool used for the ANI analysis.
We use dRep with default parameters for clustering. We have incorporated the requested information into the manuscript. (Lines 130).
- lines 135-138, "The bacterial species most represented in our MAGs collection were Odoribacter splanchnicus (MAG recovered from 70.93% samples), Barnesiella intestinihominis (62.83%), Parabacteroides distasonis (60,38%), Alistipes putredinis (54,53%) and Agathobacter rectalis (51.92%) (Figure S2, Table S2).", it will be interesting to compare (some of) these speceis with other populations, to see if these species are globally prevalent in the human gut microbiome or specific to the Estonian population.
We thank the Reviewer for this question. As highlighted in Figures 4e and 2d, the number of MAGs recovered for a given species often differs significantly from its prevalence in the population. Due to the complexities of MAG assembly, species prevalence is generally much higher, and these values do not correlate linearly, as shown in Supplementary Figure S5. Keeping in mind that species with the higher number of assembled MAGs are not the same as species with the higher prevalence, we compared our top assembled species with the most comprehensive up to date USGG collection of gut bacteria and integrated the following section in the paragraph “Population-specific Metagenome-Assembled Genomes (MAGs) reference” on page 4 in the revised version of the manuscript:
“... All these species are also well-represented in other cohorts. For example, Parabacteroides distasonis, Alistipes putredinis, and Agathobacter rectalis rank among the top 6 species in the USGG by the number of genomes. Additionally, Barnesiella intestinihominis and Odoribacter splanchnicus rank among the top 40 species out of a total of 4,644 species in the USGG database.”
- lines 143-144, "MAGs, 353 MAGs (15,64%) represent a new species according to the GTDB criteria.", these 353 MAGs might define fewer species clusters, I think the 'species' word in this sentence is misleading and can lead to an overinterpretation of the diversity, it will be more correct to report how many species clusters these MAGs defined.
We apologize for not providing sufficient clarification. In our case each cluster represented a new distinct species. We added clarification in lines 152-153.
- lines 163-168, the paragraph could be an overinterpretation, as it is unlikely that there is 'infinite' diversity, so it could be that by doubling the samples, there is already a plateau in terms of novel species clusters identified. I think this paragraph should be reconsidered.
We thank the Reviewer for this question. We have updated Figure 2b. Instead of presenting a single version of the cumulative sum of new species discoveries, we reordered the samples five times to provide a more accurate approximation of new species accumulation as the number of samples increases. Additionally, we integrated the following section in the paragraph “Novel species and comparison of the population-specific reference with global reference UHGG” on page 4 in the revised version of the manuscript:
“Our analysis so far shows a clear linear trend without indication of a plateau (although we can not exclude that plateau had been reached exactly at current sample size, which may not yet be evident).”
__Figure 4b. __The relationship between the number of samples analyzed and the cumulative number of new species identified.
- lines 182-184, "Even species which have been recovered from a large number of samples can be found in significantly more samples after mapping (Figure 2e, Table S2).", this is not novel as assembly requires higher coverage than calling a species present via mapping, please, rephrase this part.
We thank the Reviewer for this thoughtful suggestion. We included this point in the article not because of its novelty but to emphasize that even a small number of recovered MAGs per sample can still hold significant value. This is because despite a small number of assembled genomes, the same species prevalence, as detected through mapping, can still be substantial which makes it possible to use them for, for example, association study. We added this perspective based on our personal experience of initial disappointment with the small number of MAGs recovered for many new species clusters. Our intention is to prevent similar discouragement among other researchers who may begin recovering MAGs from their large population cohorts.
- lines 185-188, "which are usually extracted from a small number of samples, 185 show a prevalence exceeding 80% for some species. For example, Bacteroides faecalis has a prevalence of 97.23%, although only 1 MAG was assembled, and Bacteroides intestinigallinarum has a prevalence of 95.85% although only 2 MAGs were assembled.", this should be much better contextualized and discussed in terms of relative abundance and not only on the ability to reconstruct (which is highly impacted by coverage, which is a proxy for abundance) with its prevalence, it is known in the field that there are very highly prevalent species at very low abundance values, which are not that often reconstructed via metagenomic assembly.
We agree that understanding the causes of assembly complications is important in the field, with abundance playing a key role. Moreover, other factors such as the presence of closely related species with similar genomes or multiple strains of the same species within a sample can significantly impact assembly, even for species with high abundance. However, since this paper focuses on the potential applications of MAG assembly in large population cohorts rather than the technical aspects of assembly, our main goal was to emphasize that MAGs assembled from the samples should not be used to estimate species prevalence.
- Data availability, it appears that the provided accession number does not exist, please double-check this.
We apologies about that issue, data now available with provided accession number PRJEB76860:
Minors
- line 106, "includes 1,308 women (69.64 %) and 570 men (30.35 %)", these sums up to 99.99%, the ratio for women is 1308/1878=0.69648, so can be rounded up to 69.65%.
We thank the Reviewer for this correction. We correct numbers from 69.64% to 69.65% (Lines 114).
- line 293, "ones[Philip Hugenholtz, 2008].", citation to fix.
Thank you for the correction. We corrected the links. (Lines 414).
- Fig. 1g, why completeness is up to 25%, from the text it seemed the MAGs were screened for completeness We apologize for not providing sufficient clarification. Indeed, as noted in Lines 124-126, *"We successfully reconstructed 84,762 metagenome-assembled genomes (MAGs), an average of 45 MAGs per sample. Among these, 42,048 according to CheckM, MAGs (49.6%) have completeness > 90% and contamination 90% and contamination 50% and contamination (Lines 131-132).
- Fig. 2f says "Blue bars represent", but I believe it should be green instead of blue.
Thank you for the correction. We corrected the color.
(Lines 520).
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
The manuscript by Pantiukh et al. presents the collection of MAGs assembled from the Estonian Biobank, with a specific focus on the novel species clusters the authors defined and found associations with some of the diseases as collected among the samples available in their biobank. The manuscript is well organized. However, it lacks a bit in terms of novelty and also some statements that can mislead the readers to overinterpret some parts.
Majors
- The last paragraph of the introduction (lines 91-98) anticipates some results but lacks some methodological details. Please consider whether to move it to the results section or add very …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
The manuscript by Pantiukh et al. presents the collection of MAGs assembled from the Estonian Biobank, with a specific focus on the novel species clusters the authors defined and found associations with some of the diseases as collected among the samples available in their biobank. The manuscript is well organized. However, it lacks a bit in terms of novelty and also some statements that can mislead the readers to overinterpret some parts.
Majors
- The last paragraph of the introduction (lines 91-98) anticipates some results but lacks some methodological details. Please consider whether to move it to the results section or add very brief specifications, like (1) "sequence with deep coverage" is vague, how deep is deep? (2) "84,762 MAGs representing 2,257 species" are the 84k MAGs already quality-controlled? (3) "353 MAGs (15,6%) of the EstMB MAGs collection to represent potentially novel species." 353 are MAGs or species? As species clusters are defined later at 95% ANI, are all these 353 defining their own species clusters?
- lines 109 and 265, "11.73 +/- 3.9 Gb data per sample and 56.13 +/- 19.37 million reads per sample", numbers don't match... 11.73 Gbp is about 78M reads at 150nt read length, plus later the average depth is not 56.13 but 53.04, please double check these numbers
- line 118, "completeness > 90% and contamination < 5%", please specify either the Methods section or briefly which tool was used to estimate quality.
- line 120, "84,762 MAGs were clustered at the species level with an average nucleotide identity (ANI) threshold of 95%.", as for my previous comment, either specify the Methods or quickly mention the tool used for the ANI analysis.
- lines 135-138, "The bacterial species most represented in our MAGs collection were Odoribacter splanchnicus (MAG recovered from 70.93% samples), Barnesiella intestinihominis (62.83%), Parabacteroides distasonis (60,38%), Alistipes putredinis (54,53%) and Agathobacter rectalis (51.92%) (Figure S2, Table S2).", it will be interesting to compare (some of) these speceis with other populations, to see if these species are globally prevalent in the human gut microbiome or specific to the Estonian population.
- lines 143-144, "MAGs, 353 MAGs (15,64%) represent a new species according to the GTDB criteria.", these 353 MAGs might define fewer species clusters, I think the 'species' word in this sentence is misleading and can lead to an overinterpretation of the diversity, it will be more correct to report how many species clusters these MAGs defined.
- lines 163-168, the paragraph could be an overinterpretation, as it is unlikely that there is 'infinite' diversity, so it could be that by doubling the samples, there is already a plateau in terms of novel species clusters identified. I think this paragraph should be reconsidered.
- lines 182-184, "Even species which have been recovered from a large number of samples can be found in significantly more samples after mapping (Figure 2e, Table S2).", this is not novel as assembly requires higher coverage than calling a species present via mapping, please, rephrase this part.
- lines 185-188, "which are usually extracted from a small number of samples, 185 show a prevalence exceeding 80% for some species. For example, Bacteroides faecalis has a prevalence of 97.23%, although only 1 MAG was assembled, and Bacteroides intestinigallinarum has a prevalence of 95.85% although only 2 MAGs were assembled.", this should be much better contextualized and discussed in terms of relative abundance and not only on the ability to reconstruct (which is highly impacted by coverage, which is a proxy for abundance) with its prevalence, it is known in the field that there are very highly prevalent species at very low abundance values, which are not that often reconstructed via metagenomic assembly.
- Data availability, it appears that the the provided accession number does not exist, please double-check this.
Minors
- line 106, "includes 1,308 women (69.64 %) and 570 men (30.35 %)", these sums up to 99.99%, the ratio for women is 1308/1878=0.69648, so can be rounded up to 69.65%.
- line 293, "ones[Philip Hugenholtz, 2008].", citation to fix.
- Fig. 1g, why completeness is up to 25%, from the text it seemed the MAGs were screened for completeness <5%. Like in panel f for contamination that is never below 50%.
- Fig. 2f says "Blue bars represent", but I believe it should be green instead of blue.
Significance
General assessment: the manuscript presents a large-scale effort to reconstruct microbial genomes from metagenomes present in the Estonian biobank. This can be very useful in future analyses, as the knowledge of novel and population-specific species can improve future studies linking diseases with the microbiome data. However, the analyses presented are quite limited and also do not provide the perspective of how these newly reconstructed genomes will be integrated into public databases that can improve future microbiome profiling (both at the taxonomic and functional levels).
Advance: the manuscript presents an incremental advancement in the field, and with the limited data made available (the provided accession number was not found in the mentioned database, and only representative MAGs were deposited), it is difficult to assess how much this data can be a resource for the research community.
Audience: the audience targeted at the moment is specialized people in microbiome analysis and in particular those that are focusing on the analysis tools development (like UGG and GTDB)
Microbiome expert.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
In this paper "Metagenome-assembled genomes of Estonian Microbiome cohort reveal novel species and their links with prevalent diseases", the authors present a comprehensive analysis of metagenome-assembled genomes (MAGs) from the Estonian Microbiome cohort, offering several key insights and contributions to microbiome research. The authors assembled 84,762 MAGs from stool samples of 1,878 individuals in the Estonian Microbiome Cohort, representing 2,257 species. Notably, they identified 353 potentially novel species (15.6%) and 607 species (26.9%) not present in the global Unified Human Gastrointestinal Genome (UHGG) reference …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
In this paper "Metagenome-assembled genomes of Estonian Microbiome cohort reveal novel species and their links with prevalent diseases", the authors present a comprehensive analysis of metagenome-assembled genomes (MAGs) from the Estonian Microbiome cohort, offering several key insights and contributions to microbiome research. The authors assembled 84,762 MAGs from stool samples of 1,878 individuals in the Estonian Microbiome Cohort, representing 2,257 species. Notably, they identified 353 potentially novel species (15.6%) and 607 species (26.9%) not present in the global Unified Human Gastrointestinal Genome (UHGG) reference database.
Work is timely and important for several reasons:
- It aligns with the growing trend of population-specific microbiome studies, which are crucial for understanding regional variations in gut microbiota.
- Finding new bacterial species and population-specific microbes contributes to the expanding catalog of human-associated microorganisms.
- The associations between microbial species and common diseases provide potential avenues for future research to test those ideas into microbiome-based diagnostics or therapeutics.
Strengths:
- The study provides a valuable population-specific reference for the Estonian gut microbiome, which can enhance the accuracy of future microbiome studies in this population.
- Identifying potentially new bacterial species contributes to our understanding of microbial diversity.
- This work uncovered associations between bacterial abundance and 15 common diseases, including links with potentially new species.
- The study combined deep metagenomic sequencing with extensive phenotypic data, allowing for a more rounded analysis. The paper's focus on an Estonian population and the creation of a population-specific reference set it apart from global microbiome studies. This approach allows for detecting microbial species that need to be included in more general studies.
Drawbacks:
- While the population-specific approach is a strength, it also limits the direct applicability of findings to other populations.
- The study primarily focuses on taxonomic composition at the genus or species level, but a more in-depth functional analysis of the novel species could provide additional insights.
- Is it possible for this large dataset to distill information and have plots for strain diversity of abundant and prevalent species, including low abundance species per donor or between donors? Can authors add such a plot or discuss this? While associations between microbes and diseases were found, the study design cannot establish causal relationships. Are the authors planning to test some of the associations experimentally and see whether these observations work in in vitro or in vivo?
Minor comments:
- The authors could provide more context on how their findings compare to similar studies in other populations. What are the differences and similarities, and how does this work at the next level and set new directions?
- Figures' quality and readability can be improved easily; all of them are low resolution, and the axes are hardly visible, particularly Figure 2, which could benefit from additional labeling or explanations in the legend to improve clarity.
- A brief discussion on the potential clinical implications of the new species-disease associations would enhance the relevance. Why discovering new species are in testing and relevant for the microbiome field? Can authors add this somewhere, discussion? In lines 265-266, the authors discuss detected species per sample, on average, 389 species. Can the authors guide which plot is linked to it and whether it is possible to show the disturbing median number of species per sample to get an overall idea about the range of diversity this type of analysis can capture now? Maybe this will improve in the future; it is worth mentioning here.
Other comments:
- The key conclusions are generally convincing. The authors have successfully assembled a large number of MAGs from the Estonian population, identified potentially novel species, and established associations between microbial abundance and diseases.
- The data presented appear to support the claims well. However, the authors should emphasize and clarify that the disease associations are correlational, not causal, and further validation is required.
- Even though I am not an expert in metagenomics analysis, the current experimental design and analysis are sound to support the main claims.
- The methods section can be improved by providing more details about how samples were collected and stored and how long after storage gDNA was extracted and processed for sequencing, allowing for reproducibility. The authors provide information on the bioinformatics pipelines, including software versions and parameters, but this can again be improved by adding details about the steps between sample processing and raw data processing.
- The study includes a large cohort (1,878 samples), which provides statistical power. The statistical analyses, including linear regression models adjusted for BMI, gender, and age, seem appropriate for the type of data presented. I suggest adding a separate paragraph about how the data is processed and statistically analyzed. Authors should include:
- Appropriateness of the statistical tests used for the data types and experimental designs
- Adequate description and justification of the statistical models and test and assumptions
- Proper handling of replicates, controls, and data normalization
- Reporting of effect sizes, sample size, confidence intervals, and statistical power
- Data processing and analysis workflows
Significance
- This study represents an advance in the context of population-specific studies. Creating a comprehensive Estonian population-specific MAG reference and identifying new species contribute to our understanding of microbiome diversity.
- The work builds upon previous large-scale microbiome projects, such as those that established the Unified Human Gastrointestinal Genome (UHGG) collection but focuses on a specific population.
- The associations between microbial species (including novel ones) and common diseases provide potential avenues for future research into microbiome-based diagnostics or therapeutics.
- The findings would interest microbiome researchers, bioinformaticians, and clinicians interested in the role of the gut microbiome in health and disease.
My Expertise:
Gut microbiome, gut microbiota resilience, ecology, and evolution in microbial communities, antimicrobial resistance, high-throughput drug-bacteria interactions
-
