Quantifying data reuse in proteomics using PRIDE downloads statistics and a semi-supervised LLM-based framework
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Understanding how scientific datasets are accessed and reused is essential for resource planning and impact assessment. Here we present the PRIDE Archive download tracking infrastructure and a comprehensive analysis of 159.3 million download records from the PRIDE proteomics database (2021-2025), spanning 35,528 datasets accessed from 235 locations. The infrastructure includes nf-downloadstats, a scalable Nextflow pipeline for processing download logs, and DeepLogBot, a machine-learning framework that classifies traffic into bots, institutional download hubs, and independent user downloads. DeepLogBot combines heuristic seed selection with multi-LLM annotation (Claude and Qwen3) to produce gold-standard training labels, achieving 92.2% bot classification accuracy on a held-out test set. After separating bot traffic, analysis reveals downloads from 214 countries/regions, 249 institutional download hubs, and a concentrated reuse distribution, with the top five countries (United States, United Kingdom, Germany, China, and Canada) accounting for over 54% of independent user downloads. These findings provide actionable insights for repository infrastructure planning and highlight the importance of distinguishing automated from individual access in scientific data resources.