SMARTER-database: a tool to integrate SNP array datasets for sheep and goat breeds

Curation statements for this article:
  • Curated by GigaByte

    GigaByte logo

    Editors Assessment:

    This paper presents the SMARTER database, a collection of tools and scripts to gather, standardize, and share with the scientific community a comprehensive dataset of genomic data and metadata information on worldwide small ruminant populations. Which has come out of the EU multi-actor (12 country) H2020 project called SMARTER: SMAll RuminanTs breeding for Efficiency and Resilience. This bringing together genotypes for about 12,000 sheep and 6,000 goats, alongside phenotypic and geographic information. The paper providing insight into how the database was put together, presenting the code for the SMARTER—frontend, backend and API, alongside instructions for users. Peer review tested the platform and provided suggestions on improving the metadata. Demonstrating the project provides valuable information on sheep and goat populations around the world, that can be an essential tool for ruminant researchers. Enabling them to generate new insights and offer the possibility to store new genotypes and drive progress in the field.

    This evaluation refers to version 1 of the preprint

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Underutilized sheep and goat breeds can adapt to challenging environments due to their genetics. Integrating publicly available genomic datasets with new data will facilitate genetic diversity analyses; however, this process is complicated by data discrepancies, such as outdated assembly versions or different data formats. Here, we present the SMARTER-database, a collection of tools and scripts to standardize genomic data and metadata, mainly from SNP chip arrays on global small ruminant populations, with a focus on reproducibility. SMARTER-database harmonizes genotypes for about 12,000 sheep and 6,000 goats to a uniform coding and assembly version. Users can access the genotype data via File Transfer Protocol and interact with the metadata through a web interface or using their custom scripts, enabling efficient filtering and selection of samples. These tools will empower researchers to focus on the crucial aspects of adaptation and contribute to livestock sustainability, leveraging the rich dataset provided by the SMARTER-database. Availability and implementation The code is available as open-source software under the MIT license at https://github.com/cnr-ibba/SMARTER-database.

Article activity feed

  1. Editors Assessment:

    This paper presents the SMARTER database, a collection of tools and scripts to gather, standardize, and share with the scientific community a comprehensive dataset of genomic data and metadata information on worldwide small ruminant populations. Which has come out of the EU multi-actor (12 country) H2020 project called SMARTER: SMAll RuminanTs breeding for Efficiency and Resilience. This bringing together genotypes for about 12,000 sheep and 6,000 goats, alongside phenotypic and geographic information. The paper providing insight into how the database was put together, presenting the code for the SMARTER—frontend, backend and API, alongside instructions for users. Peer review tested the platform and provided suggestions on improving the metadata. Demonstrating the project provides valuable information on sheep and goat populations around the world, that can be an essential tool for ruminant researchers. Enabling them to generate new insights and offer the possibility to store new genotypes and drive progress in the field.

    This evaluation refers to version 1 of the preprint

  2. AbstractUnderutilized sheep and goat breeds have the ability to adapt to challenging environments due to their genetic composition. Integrating publicly available genomic datasets with new data will facilitate genetic diversity analyses; however, this process is complicated by important data discrepancies, such as outdated assembly versions or different data formats. Here we present the SMARTER-database, a collection of tools and scripts to standardize genomic data and metadata mainly from SNP chips arrays on global small ruminant populations with a focus on reproducibility. SMARTER-database harmonizes genotypes for about 12,000 sheep and 6,000 goats to a uniform coding and assembly version. Users can access the genotype data via FTP and interact with the metadata through a web interface or programmatically using their custom scripts, enabling efficient filtering and selection of samples. These tools will empower researchers to focus on the crucial aspects of adaptation and contribute to livestock sustainability, leveraging the rich dataset provided by the SMARTER-database.

    This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.139). These reviews are as follows.

    Reviewer 1. Ran Li

    The authors presented an online SMARTER-database, which collected a large number of genotype data for sheep and goats. The resources are of great importance for the community.

    My major concerns:

    1. The below link is not accessible: webserver.ibba.cnr.it
    2. For sheep, the database support reference genome assembly of Oar3 and Oar4, but actually Oar 3 is rarely used. Instead, the current ovine reference genome assembly (ARS-UI_Ramb_v3.0) is not supported.
    3. For the presentation of metadata (https://webserver.ibba.cnr.it/smarter/breeds?species=Sheep), I suggest additional columns indicating the region and country should be provided.
    4. For the datasets (https://webserver.ibba.cnr.it/smarter/datasets), references are needed to know where the data are from.

    Re-review:

    My comments have been properly addressed. The manuscript is acceptable for publication.

    Reviewer 2. Hans Lenstra and Johannes A. Lenstra

    Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? Yes. This is implicitly clear and does not need to elaborate upon.

    As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code? No. This does not to seem necessary.

    Is the code executable? unable_to_test Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? unable_to_test Is the documentation provided clear and user friendly? Yes. I did not test this.

    Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level? No. I did not see such a list, but I would not be able to assess this.

    Have any claims of performance been sufficiently tested and compared to other commonly-used packages? not_applicable

    Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No. I did not find any of this but it does not seem to be essential.

    Additional Comments: This manuscript describes a highly useful database of sheep and goat genome-wide SNP genotypes from several sources, supplemented with phenotypes and geographic locations. I recommend this manuscript for publication in Gigascience after a revision. There is some missing information, whereas the presentation should become less cryptic to readers who are less familiar with the bioinformatic terminology. Missing info.

    1. The title and abstract do not mention that SMARTER focuses on SNPs that are genotyped on bead arrays or related technologies. The focus on the genome-wide (GW) SNP genotypes, which only partially represents the total genomic diversity, should already be clear from the Title and the Abstract.
    2. Nowadays there are more publications on WGS data, T2T sequences and pangenomes than on GW SNP genotypes, so people may wonder if the GW SNP genotypes still are useful. It may be emphasized that bead-arrays allow an affordable analysis of many animals and that genotypes derived from WGS data contain many false homozygote scores if not sequenced at a very high coverage.
    3. Figures 2 and 3 give an idea of the geographic coverage, but what is the unit of the numbers that are visualized in the heat map (0 to 2300 for sheep, 0 to 1100 for goats)?
    4. It is not clear which published data have been used or not. We recommend presenting a supplemental table describing the current contents: country, breed, number of animals, number of SNPs (at least 50K or HD), reference.
    5. Is there an organized effort to update the database, which ideally should contain all published GW SNP databases?
    6. To my experience for most HW SNP datasets only the filtered data after quality control (typically 45 to 49K, less than 42K if sheep 50K and HD genotypes are combined) are available. How is this handled?
    7. It may be mentioned that after omission of A/T and G/C SNPs the TOP strand consists only of A/C and A/G SNPs.
    8. The problematic SNPs are mentioned twice within the last paragraph of the section Data Composition.
    9. Does SMARTER allow to store phased datasets and show the variant haplotypes? These can now be generated by long-read sequencing and are needed for several downstream analysis options. 10. Table 1: OAR3 = Oar_v3.1 and OAR4 = Oar_v4.0? Please use the official codes.
    10. Are there options to convert the data to newer assemblies? For instance, the sheep ARS-UI_Ramb_v3.0 is superior to Oar_v4.0. I have used an NCBI tool for conversion of Oar_v1.0 (most popular for 50K datasets) and Oar_3.1 (used often for sheep HD datasets) to Oar_v4.0, but this tool has probably been discontinued and was not available for goat assemblies.
    11. I repeatedly found that most published or unpublished databases contain several errors such as duplicates and outliers by mislabeling or crossbreeding. Because these are better removed prior to downstream analysis, data curation would be desirable, for instance by an inspection of a NJ tree of individuals. This also shows the degree of breed-level differentiation, for instance the relationships of different populations of a transboundary breed. These caveats should at least be mentioned.
    12. Another caveat: is there a systematic check on the validity of the merging of datasets by testing if breeds sampled independently by different institutes cluster closely together? Presentation.
    13. Abbreviations should not be used in abstract. What is “REST API”? These abbreviations of course are in the list, but what is “Representational State Transfer”? And “JSON Web Token”?
    14. Figure 1 needs more guidance via the legend. The boxes show alternative formats? What are “str”, “dict “?
    15. Figure 5 is useful and seems to retrieve data for the goat Alpine and Bionda dell'Adamello breeds. It would also be useful to show other “API-URL” (this is user input?) while describing in plain language what is being accomplished.
    16. Figure 6: bold indicates the user input? What is exactly a “array [string]” (give an example)? A few other examples may be most instructive and familiarize the reader with the logic of SMARTER.
    17. In the section “The SMARTER-database project”: what is a mongoengine?
    18. In the same section: “Finally the VariantSpecie abstract class is inherited by . . .”: this sentence is difficult to understand.
    19. In the section Reproducibility: please give a short description of what is the use of the Conda and Docker programs.
    20. Same section: “Raw data undergoes initial exploration”, “structure and potential issues”: can you be more specific? The last part of this section is also difficult to follow.

    Re-review: This paper presents the SMARTER database, a collection of tools and scripts to gather, standardize, and share with the scientific community a comprehensive dataset of genomic data and metadata information on worldwide small ruminant populations. Which has come out of the EU multi-actor (12 country) H2020 project called SMARTER: SMAll RuminanTs breeding for Efficiency and Resilience. This bringing together genotypes for about 12,000 sheep and 6,000 goats, alongside phenotypic and geographic information. The paper providing insight into how the database was put together, presenting the code for the SMARTER—frontend, backend and API, alongside instructions for users. Peer review tested the platform and provided suggestions on improving the metadata. Demonstrating the project provides valuable information on sheep and goat populations around the world, that can be an essential tool for ruminant researchers. Enabling them to generate new insights and offer the possibility to store new genotypes and drive progress in the field.