Matching data references and institutional output to map the reuse of biomedical research data

Avihay Cohen
Blanka Ivanovic
Anastasiia Iarkaeva
Vladislav Nachev
Evgeny Bobrov

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Data sharing is increasingly expected by funders, journals, institutions, and other actors in the research system. This expectation is based primarily on the assumption that some of the shared data will be reused. Reuse can take many forms and have many purposes, but currently the only way to detect reuse of data at scale is to screen the research literature. We used the Data Citation Corpus to detect instances in which datasets shared by researchers from our biomedical research institution had been referenced, which we interpreted as indicative of reuse. We observed that 4.9% of datasets shared between 2020 and 2023 had been referenced by February 2025. Overall, 175 Charité datasets from this period were reused, which had been referenced 1497 times. The large majority of reused datasets were from ‘omics’ fields, which generate particularly structured and standardized data. Data from humans had a higher probability of reuse, as well as COVID-19-related data. Dataset properties indicative of technical reusability as DOIs and CC licenses had no positive influence on reuse probability. We conducted a complimentary analysis of datasets shared alongside data articles. Here, we observed that a major fraction of references to data articles were in fact referring to shared datasets. Based on a sample, we extrapolated that indirect data citations account for an additional approximately 846 reuse cases. Unlike data references, indirect data citations often referred to datasets in disciplinary repositories from fields outside of ‘omics’, as well as general-purpose repositories. Our study shows that data are reused on a large scale, but at the same time reuse of data in many fields is limited, especially if data are deposited in general-purpose repositories. Data is reused most if it is shared with disciplinary metadata and/or described by data articles. Our methods do not allow us to map data reuse in its entirety, and further large-scale studies on the determinants, temporal patterns and purposes of data reuse are needed.

Version published to 10.31222/osf.io/z9kjf_v1 on OSF Preprints
Mar 31, 2026

Accelerating metadata annotation in collaborative research centers: A hybrid AI workflow for biomedical entities

This article has 9 authors:
1. Manuel Watter
2. Felix Engel
3. Aref Kalantari
4. Claudia Giuliani
5. Karin Schuller
6. Claus-Werner Franzke
7. Markus Sperandio
8. Harald Binder
9. Klaus Kaier
This article has no evaluationsLatest version Mar 30, 2026
Manuscript submission systems and metadata completeness in Crossref: patterns and associations

This article has 2 authors:
1. Hans de Jonge
2. Bianca Kramer
This article has no evaluationsLatest version Feb 23, 2026
Construction of Personal Health Knowledge Graphs for Clinical Data Harmonization in Breast Cancer

This article has 16 authors:
1. Wenjie Liang
2. Rutger van Mierlo
3. Anne-Lore Bynens
4. Remzi Celebi
5. Ensar Erol
6. Ömer Durukan Kılıç
7. Isabelle de Zegher
8. Katerina Serafimova
9. Todor Primov
10. Svetla Boytcheva
11. Michaela Kargl
12. Mall Maasik
13. Cecile J.A. Wolfs
14. Aiara Lobo Gomes
15. Andre Dekker
16. Petros Kalendralis
This article has no evaluationsLatest version Feb 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Accelerating metadata annotation in collaborative research centers: A hybrid AI workflow for biomedical entities

Manuscript submission systems and metadata completeness in Crossref: patterns and associations

Construction of Personal Health Knowledge Graphs for Clinical Data Harmonization in Breast Cancer