Matching data references and institutional output to map the reuse of biomedical research data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Data sharing is increasingly expected by funders, journals, institutions, and other actors in the research system. This expectation is based primarily on the assumption that some of the shared data will be reused. Reuse can take many forms and have many purposes, but currently the only way to detect reuse of data at scale is to screen the research literature. We used the Data Citation Corpus to detect instances in which datasets shared by researchers from our biomedical research institution had been referenced, which we interpreted as indicative of reuse. We observed that 4.9% of datasets shared between 2020 and 2023 had been referenced by February 2025. Overall, 175 Charité datasets from this period were reused, which had been referenced 1497 times. The large majority of reused datasets were from ‘omics’ fields, which generate particularly structured and standardized data. Data from humans had a higher probability of reuse, as well as COVID-19-related data. Dataset properties indicative of technical reusability as DOIs and CC licenses had no positive influence on reuse probability. We conducted a complimentary analysis of datasets shared alongside data articles. Here, we observed that a major fraction of references to data articles were in fact referring to shared datasets. Based on a sample, we extrapolated that indirect data citations account for an additional approximately 846 reuse cases. Unlike data references, indirect data citations often referred to datasets in disciplinary repositories from fields outside of ‘omics’, as well as general-purpose repositories. Our study shows that data are reused on a large scale, but at the same time reuse of data in many fields is limited, especially if data are deposited in general-purpose repositories. Data is reused most if it is shared with disciplinary metadata and/or described by data articles. Our methods do not allow us to map data reuse in its entirety, and further large-scale studies on the determinants, temporal patterns and purposes of data reuse are needed.