Quantifying data reuse in proteomics using PRIDE downloads statistics and a semi-supervised LLM-based framework

Suresh Hewapathirana
Jingwen Bai
Chakradhar Bandla
Selvakumar Kamatchinathan
Deepti J Kundu
Nithu Sara John
Boma Brown-Harry
Nandana Madhusoodanan
Joan Marc Riera Duocastella
Juan Antonio Vizcaíno
Yasset Perez-Riverol

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Understanding how scientific datasets are accessed and reused is essential for resource planning and impact assessment. Here we present the PRIDE Archive download tracking infrastructure and a comprehensive analysis of 159.3 million download records from the PRIDE proteomics database (2021-2025), spanning 35,528 datasets accessed from 235 locations. The infrastructure includes nf-downloadstats, a scalable Nextflow pipeline for processing download logs, and DeepLogBot, a machine-learning framework that classifies traffic into bots, institutional download hubs, and independent user downloads. DeepLogBot combines heuristic seed selection with multi-LLM annotation (Claude and Qwen3) to produce gold-standard training labels, achieving 92.2% bot classification accuracy on a held-out test set. After separating bot traffic, analysis reveals downloads from 214 countries/regions, 249 institutional download hubs, and a concentrated reuse distribution, with the top five countries (United States, United Kingdom, Germany, China, and Canada) accounting for over 54% of independent user downloads. These findings provide actionable insights for repository infrastructure planning and highlight the importance of distinguishing automated from individual access in scientific data resources.

Version published to 10.64898/2026.04.16.718670 on bioRxiv
Apr 23, 2026

Improving package annotation in metabolomics and proteomics via robust, ontology-driven LLM integration

This article has 8 authors:
1. Sebastian Lobentanzer
2. Helge Hecht
3. Vincent J Carey
4. Maria A Doyle
5. Alban Gaignard
6. Hervé MENAGER
7. Júlia Mir
8. Claire Rioualen
This article has no evaluationsLatest version Apr 14, 2026
Revealing the Hidden Landscape of Public Metabolomics Data Reuse in MetaboLights

This article has 3 authors:
1. Ibrahim Karaman
2. Thomas Payne
3. Juan Antonio Vizcaíno
This article has no evaluationsLatest version May 5, 2026
All Models are Wrong, Some are Annotated: Automating Metadata in Biomedical Repositories

This article has 3 authors:
1. Inessa Cohen
2. Hongyi Yu
3. Robert A. McDougal
This article has no evaluationsLatest version Apr 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Improving package annotation in metabolomics and proteomics via robust, ontology-driven LLM integration

Revealing the Hidden Landscape of Public Metabolomics Data Reuse in MetaboLights

All Models are Wrong, Some are Annotated: Automating Metadata in Biomedical Repositories