Bibliometric Insights into Medical Data Science Applications in Genomics: Evidence from Kaggle and Dimensions Datasets

Faraz Shamim
Raziya Akhtar Hussain

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

While bibliometrics is widely used to map scientific fields, few studies have integrated non-traditional sources like competitive data science platforms to analyze the gap between theory and practice. This study addresses this methodological gap by developing and applying a dual-source bibliometric framework to the interdisciplinary field of medical genomics data science, aiming to map its complete knowledge lifecycle.

Methods

We analyzed two corpora from 2015 to late 2025; academic literature from Dimensions (n=825 publications) and practical challenges from Kaggle (n=15 competitions). The analysis employed co-authorship, co-citation, and keyword co-occurrence network analysis to map social and intellectual structures. Science mapping techniques, including LDA-based thematic analysis with a strategic diagram (Callon’s map) and Kleinberg’s burst detection algorithm, were used to model the field’s evolution and identify emerging research fronts.

Results

Publication growth follows a logistic (S-shaped) model (AIC=60.35, R ² =0.998), indicating field maturation. The co-authorship network exhibits a high average clustering coefficient (0.946), confirming a “small-world” structure. Thematic analysis identified 10 distinct topics, with “Core Machine Learning Models” acting as the primary motor theme. A key integrative finding is a measurable diffusion lag for novel architectures like Transformers, where their popularization on Kaggle precedes widespread academic adoption (p<0.05 in lead-lag analysis). Furthermore, open data sharing was found to have a statistically significant positive effect on citation impact (p=0.047).

Conclusions

The integration of practical competition data provides a more nuanced view of a field’s trajectory, revealing innovation pathways and translational gaps not visible from academic data alone. This dual-source framework serves as a valuable model for future bibliometric studies of rapidly evolving, application-driven scientific domains.

Version published to 10.1101/2025.10.24.684317 on bioRxiv
Oct 24, 2025

Visualization and Hotspot Prediction of Termite Research Based on Bibliometrics

This article has 6 authors:
1. Xia Zhang
2. Linjie Li
3. Jiahe You
4. Kai Ouyang
5. Binbin Hou
6. Ziqin Feng
This article has no evaluationsLatest version Feb 2, 2026
Integrating scientometric indicators with linguistic data mining to enhance international research collaboration

This article has 3 authors:
1. Gonzalo Ruiz
2. Jose Divasón
3. Carmen Pérez-Llantada
This article has no evaluationsLatest version Jan 14, 2026
Software Applications in Biomedicine: A Narrative Review of Translational Pathways from Data to Decision

This article has 1 author:
1. Gabriela Georgieva Panayotova
This article has no evaluationsLatest version Jan 6, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Visualization and Hotspot Prediction of Termite Research Based on Bibliometrics

Integrating scientometric indicators with linguistic data mining to enhance international research collaboration

Software Applications in Biomedicine: A Narrative Review of Translational Pathways from Data to Decision