Quantitative monitoring of nucleotide sequence data from genetic resources in context of their citation in the scientific literature

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

Linking nucleotide sequence data (NSD) to scientific publication citations can enhance understanding of NSD provenance, scientific use, and reuse in the community. By connecting publications with NSD records, NSD geographical provenance information, and author geographical information, it becomes possible to assess the contribution of NSD to infer trends in scientific knowledge gain at the global level.

Findings

We extracted and linked records from the European Nucleotide Archive to citations in open-access publications aggregated at Europe PubMed Central. A total of 8,464,292 ENA accessions with geographical provenance information were associated with publications. We conducted a data quality review to uncover potential issues in publication citation information extraction and author affiliation tagging and developed and implemented best-practice recommendations for citation extraction. We constructed flat data tables and a data warehouse with an interactive web application to enable ad hoc exploration of NSD use and summary statistics.

Conclusions

The extraction and linking of NSD with associated publication citations enables transparency. The quality review contributes to enhanced text mining methods for identifier extraction and use. Furthermore, the global provision and use of NSD enable scientists worldwide to join literature and sequence databases in a multidimensional fashion. As a concrete use case, we visualized statistics of country clusters concerning NSD access in the context of discussions around digital sequence information under the United Nations Convention on Biological Diversity.

Article activity feed

  1. nucleotide

    **Reviewer 3. Takeru Nakazato **

    I have reviewed this manuscript with integrity, but I'm a little confused about it because I usually use NCBI PubMed/GenBank data. If my points are off the mark, please point them out.

    1. In NCBI PubMed, the nucleotide sequence entries referenced in the article are listed in PubMed data as external DB links (although not perfect), and by extracting these, the relationship between the PubMed and Nucleotide entries can be extracted. The NCBI website also provides these links from Nucleotide in the Related information section (e.g. https://pubmed.ncbi.nlm.nih.gov/19193256/). I found that the ePMC website also has a link in the Data section for nucleotide sequence entries referenced in the paper (e.g., https://europepmc.org/article/MED/19193256). Do you use any of these external links in ePMC data in this work? I think it is very difficult to extract nucleotide IDs by text mining, especially since Nucleotide sequence IDs are not in a fixed format. I think these links will be a great help in doing text mining.
    2. In NCBI PubMed, MeSH keywords are assigned to each article for indexing the literature. MeSH keywords also include country keywords (e.g. https://pubmed.ncbi.nlm.nih.gov/19193256/). In ePMC Is it possible to use keywords like MeSH in ePMC? Do you have any opinions about using such country keywords?
    3. I found some great statistics and visualizations of this data on the site the authors provide about it. I would be happy to show these in this manuscript as a result of this work, but please follow the journal's policies and precedents.
    4. Do the authors think that users should reuse the created data for this product? Or is it recommended that users create their own data using the creation program? If the former, what is your plan for the frequency of updating the data?
    5. In Figure 1, I felt that it would be easier for the reader to understand if I emphasized (by changing the line or fill of the box) whether the data in each step is Nucleotide data, literature data, or ID pairs extracted from those data.
  2. Linking

    Reviewer 2. Michael Fire.

    The idea of curating this dataset is both important, and can contribute to the scientific community. Additionally, in most parts, the paper is well written. However, the manuscript has some major issue that needs to solve before it would be ready for publication. The Good:

    • The dataset presented in the paper can be very useful to the scientific community
    • The authors invested many efforts in making the paper reproducible. Both the project's code and dataset are open
    • The project has a friendly and helpful web interface. Things that need to improve: Major Issues:
    • Although this paper is not a standard research paper, the article is missing more context to other works. I believe the context of the manuscript will be more explicit by adding a Related Work section that provides an overview of other papers that generated similar datasets.
    • Most of the analysis is based on the PubMed datasets, which is a relatively small dataset. There are other open datasets that I think it is important to use to get a fuller picture, such as Microsoft Academic, AMiner, Semantic Scholar, bioXiv, and arXiv. I understand that performing a full-text search on these datasets can be challenging. However, the paper's results need to be validated by using some of these datasets.
    • The manuscript's quality needs to be improved (text, figures' resolutions, etc.). Minor Issues:
    • In my opinion, the overall structure of the paper can be improved.
    • There is no need to explain the FAIR data principle
    • Using Microsoft Academic dataset can assist in mapping between author to a unique id
    • Mapping between an institute or location to a country can be more accurately done by utilizing geolocation code packages, such as geopy

    Re-review After reading the submitted "policy paper," the goal and contributions of this study and dataset became clear. I believe this dataset and data visualization interface can be beneficial for the academic community. I think the paper will be ready for publication after fixing the following minor issues:

    • It is very challenging to understand Figure 1. I recommend adding additional figures that better explains how each part of the system work with more details.
    • Even though the quality of the figures was improved, they are still of low quality, and it is hard to read the figures, especially Figure 4.
    • The paper needs to be carefully proofread for punctuation mistakes.
  3. Background

    Reviewer 1. Gianmaria Silvello.

    Figure 1 is not readable. The sampling process lowered the quality of the image and made the text not readable. Please, use vectorial images (e.g., PDF or EPS). Anyhow, I could understand the process from the descriptive text. Figure 2 is readable, but the quality is relatively low. Nevertheless, I do not think this figure is instrumental; it is a simple logical schema of a relational database. Uploading the SQL dump or the SQL schema in an external repository and reference it in the paper would be enough. The sentence "we imported an ORACLE SQL data warehouse that employs state-of-the-art database technologies" is not very clear. What do you mean by "imported a data warehouse"? Could you provide more details about the DBMS you used? To my understanding, you designed a relational model. You then implemented it in SQL using an Oracle DBMS (MySQL? or the native Oracle DBMS?) to store and query the data. Check page 9 description and add some details to avoid confusion. This is not a key passage though, I am sure that you handled the data somehow, and the paper's focus is not on this. "reference integrity between the tables was checked" -> This is a "weird" statement. Reference integrity is a constraint to guarantee the consistency of data. You "check the integrity" when you store the data in the DB, and if it is not validated, the data cannot be stored in the DB. So, I do not understand this sentence that is not explained anymore. Indeed, the paragraph continues by talking about the SQL queries to count the paper identifiers (this is not directly linked to reference integrity, or at least you should explain what you mean). Recent analysis about issues related to ORCID ids and duplication of ids can be found here: http://ceurws.org/Vol-2816/paper10.pdf Table 1 is not that useful; it can be described in the text that you did the experiment and verified the discrepancies between open access publications and paywalled papers. It is a well-known problem, and it is not analyzed in-depth here. I think you can get rid of it without affecting the quality of the paper. Figure 4, like all the other images, is not readable. I directly accessed the Webapp, which works fine. The paper is well-written, and the data collection is fine. Nevertheless, the article is a bit anti-climatic because there, not many provided insights. You discuss what we can do with the data, but little analysis of the data themselves. We could use some more in-depth analysis and a few insights about the achievable outcomes we can get using the collected data. Also, more about the best practices that should be defined in the field would be a nice addition.

    Re-review:

    The authors comprehensively answered to this reviewer comments. The quality of the paper is improved and the modifications are in line with what was expected. I have no further observations.