PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.

Results

Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.

Availability

https://pephub.databio.org

Article activity feed

  1. Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.

    This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer name: **Weiwen Wang **(R1)

    The author has addressed most of my concerns, although some issues remain unresolved due to hardware and technical limitations.

  2. Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.

    This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer name: Weiwen Wang (original submission)

    This manuscript by LeRoy et al. introduces PEPhub, a database aimed at enhancing the sharing and interoperability of biological metadata using the PEP framework. One of the key highlights of this manuscript is the visualization of the PEP framework, which improves the adoption of the PEP framework, facilitating the reuse of metadata. Additionally, PEPhub integrates data from GEO, making it convenient for users to access and utilize. Furthermore, PEPhub offers metadata validation, allowing users to quickly compare their PEP with other PEPhub schemas. Another notable feature is the natural language search, which further enhances the user experience. Overall, PEPhub provides a comprehensive solution that promotes efficient metadata sharing, while leveraging the impact of the PEP framework in organizing large-scale biological research projects.While this manuscript was interesting to read, I have several concerns regarding its "semantic" search system and the interaction of PEPHub.1.

    The authors mentioned their use of a tool called "pepembed" to embed PEP descriptions into vectors. However, I was unable to locate the tool on GitHub, and there is limited information in the Method section regarding this. Could the authors provide additional details regarding the process of embedding vectors?2. The authors implemented semantic search as an advantage of PEPhub. However, they did not evaluate the effectiveness of their natural language search engine, such as assessing accuracy, recall rate, or F1 score. It would be beneficial for the authors to perform an evaluation of their natural language search engine and provide metrics to demonstrate its performance. This would enhance the credibility and reliability of their claims regarding the advantages of natural language search in PEPhub.3. It would be more beneficial to include the metadata in the search system rather than solely relying on the project description. For instance, when I searched for SRX17165287 (https://pephub.databio.org/geo/gse211736?tag=default), no results were returned.4. When creating a new PEP, it appears that I can submit two samples with identical values. According to the PEP framework guidelines, it is mentioned that "Typically, samples should have unique values in the sample table index column". Therefore, the authors should enhance their metadata validation system to enforce this uniqueness constraint. Additionally, if I enter two identical values in the sample field and then attempt to add a SUBSAMPLE, an error occurs. However, when I modify one of the samples, I am able to save it successfully.5. The error messages should provide more specific guidance. Currently, when attempting to save metadata with an incorrect format, all error messages are displayed as: "Unknown error occurred: Unknown".6.

    PEPhub should consider providing user guidelines or examples on how to fill in subsample metadata and any relevant rules associated with it.7. In the Validation module, what are the rules for validation? Does it only check for the required column names in the schema, or does it also validate the content of the metadata, such as whether the metadata is in the correct format (e.g., int or string)? Additionally, it would be beneficial to provide an option to download the relevant schema and clearly specify the required column names in the schema. This would enable users to better organize their PEP to comply with the schema format and ensure that their metadata is accurately validated.8. This version of PEPHub primarily focuses on metadata. Have the authors considered any plans to expand this database to include data/pipeline management within the PEP framework? It would be valuable for the authors to discuss their future plans for PEPHub in this manuscript.Some minor concerns:1. When searching for content within a specific namespace, it would be beneficial for the pagination bar at the bottom of the webpage to display the number of pages. Now there are only Previous/Next buttons.2. As a web service, it is better to show the supporting browsers, such as Google Chrome (version xxx and above), Firefox (version xxx and above). I failed to open PEPHub website using an old version of Chrome.

  3. Background As biological data increases, we need additional infrastructure to share it and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important, and in some ways has a wider scope than sharing data itself.Results Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data, or to share new data.

    This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer name: Jeremy Leipzig (original submission)

    Metadata describes the who, what, where, when, and why of an experiment. Sample metadata is arguably the most important of these, but not the only type. LeRoy et al describes a user-centric sample metadata management system with extensibility, support for multiple interface modalities, and fuzzy semantic search.This system and portal, PEPHub, bridges the gaps between LIMS, which are tightly bound to the wet lab, metadata fetchers like GEOfetch (from the same group) or pysradb, and public portals like MetaSRA and the others listed in . Then and both of which don't allow you to roll your own portal internally, and whose search criteria are not fuzzy or semantic.People have been storing metadata in bespoke databases for decades, but not in an interoperable mature fashion. The PepHUB portal builds on some existing Pep standards by the same group, introducing a restful API and GUI.I find this paper a novel and compelling submission but would like the following minor revisions:1. Typically in SRA a sample refers to a dna sample drawn from a tissue sample (ie BioSample) and then runs describe sequencing attempts on those dna samples, and files are produced from each of the runs. It is unclear to me how someone working in an internal lab using PEPHub would know how to extract the file locations of sequence files associated with a sample if these are many-to-one. In the GEO example provided I can click on the SRX link to see the runs and files but how would this work for an internally generated entry? I need the authors to explain this either as a response or in the text.2. I think the paper has to briefly describe how the authors envision how PEPhub should interface with or replaces a LIMS for labs that are producing their own data and describe how it can help accelerate the SRA submission process for these data generating labs.3. Change "Bernasconi2021" to META-BASE in the text4. Some of the search confidence measures show an absurd level of significant digits (e.g.56.99999999999999% Please round that as it's only used for sorting.