Hugging Face as a Data Space for Agricultural Datasets: A PRISMA-Based Systematic Analysis

Alexander Rachmann
Hendrik Poschmann
Lucas Weißbeck

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

(1) Background: Hugging Face is one of the largest platforms for machine-learning datasets, hosting collections of all kinds beyond its core focus on natural language processing. Whether and how these datasets can be leveraged for agricultural informatics is an open question. (2) Methods: A systematic data-space analysis structured by the PRISMA 2020 methodology was conducted. Using the search terms “farming” and “agriculture”, 128 datasets were identified on the platform, of which 126 could be fully analysed. (3) Results: Datasets cover mostly crops (42 %). English dominates (71 %); 13 languages are represented in total. The distribution of dataset sizes is strongly right-skewed (mean 156,346 entries; median 1,000). Parquet is the most common format (43 %); 92 % of datasets appear to contain human–LLM dialogues. (4) Conclusions: The available agricultural datasets on Hugging Face are thematically and qualitatively heterogeneous. Future work should develop prototypes to test if the available datasets are usable as data base for crop-related applications, and to identify potential gaps in the data space.

Version published to 10.20944/preprints202603.2043.v1
Mar 27, 2026

ISGD : A Dataset for Demographically-Aware Facial Analysis and Privacy-First Skincare Recommendation

This article has 5 authors:
1. Shreyansh Mishra
2. Himal Rana
3. Ankit Yadav
4. Chirag Bhut
5. Tanmoy Hazra
This article has no evaluationsLatest version Mar 13, 2026
A Multi-Modal Dataset for Automated Phenological Stage Mapping in Actinidia chinensis

This article has 9 authors:
1. Isabel Pinheiro
2. Pedro Moura
3. Leandro Rodrigues
4. Germano Moreira
5. Rui Manuel Coutinho
6. Francisco Terra
7. António Valente
8. Mário Cunha
9. Filipe Neves dos Santos
This article has no evaluationsLatest version Mar 13, 2026
NSCH-Flourishing-ML: A Curated Dataset and Reproducible Pipeline for Machine Learning Analysis of Child Flourishing

This article has 4 authors:
1. Miguel Arcos-Argudo
2. Rodolfo Bojorque
3. Fernando Pesántez
4. Kely Nieto-Andrade
This article has no evaluationsLatest version Mar 24, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

ISGD : A Dataset for Demographically-Aware Facial Analysis and Privacy-First Skincare Recommendation

A Multi-Modal Dataset for Automated Phenological Stage Mapping in Actinidia chinensis

NSCH-Flourishing-ML: A Curated Dataset and Reproducible Pipeline for Machine Learning Analysis of Child Flourishing