Bibliometric Insights into Medical Data Science Applications in Genomics: Evidence from Kaggle and Dimensions Datasets
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
While bibliometrics is widely used to map scientific fields, few studies have integrated non-traditional sources like competitive data science platforms to analyze the gap between theory and practice. This study addresses this methodological gap by developing and applying a dual-source bibliometric framework to the interdisciplinary field of medical genomics data science, aiming to map its complete knowledge lifecycle.
Methods
We analyzed two corpora from 2015 to late 2025; academic literature from Dimensions (n=825 publications) and practical challenges from Kaggle (n=15 competitions). The analysis employed co-authorship, co-citation, and keyword co-occurrence network analysis to map social and intellectual structures. Science mapping techniques, including LDA-based thematic analysis with a strategic diagram (Callon’s map) and Kleinberg’s burst detection algorithm, were used to model the field’s evolution and identify emerging research fronts.
Results
Publication growth follows a logistic (S-shaped) model (AIC=60.35, R 2 =0.998), indicating field maturation. The co-authorship network exhibits a high average clustering coefficient (0.946), confirming a “small-world” structure. Thematic analysis identified 10 distinct topics, with “Core Machine Learning Models” acting as the primary motor theme. A key integrative finding is a measurable diffusion lag for novel architectures like Transformers, where their popularization on Kaggle precedes widespread academic adoption (p<0.05 in lead-lag analysis). Furthermore, open data sharing was found to have a statistically significant positive effect on citation impact (p=0.047).
Conclusions
The integration of practical competition data provides a more nuanced view of a field’s trajectory, revealing innovation pathways and translational gaps not visible from academic data alone. This dual-source framework serves as a valuable model for future bibliometric studies of rapidly evolving, application-driven scientific domains.