Hugging Face as a Data Space for Agricultural Datasets: A PRISMA-Based Systematic Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
(1) Background: Hugging Face is one of the largest platforms for machine-learning datasets, hosting collections of all kinds beyond its core focus on natural language processing. Whether and how these datasets can be leveraged for agricultural informatics is an open question. (2) Methods: A systematic data-space analysis structured by the PRISMA 2020 methodology was conducted. Using the search terms “farming” and “agriculture”, 128 datasets were identified on the platform, of which 126 could be fully analysed. (3) Results: Datasets cover mostly crops (42 %). English dominates (71 %); 13 languages are represented in total. The distribution of dataset sizes is strongly right-skewed (mean 156,346 entries; median 1,000). Parquet is the most common format (43 %); 92 % of datasets appear to contain human–LLM dialogues. (4) Conclusions: The available agricultural datasets on Hugging Face are thematically and qualitatively heterogeneous. Future work should develop prototypes to test if the available datasets are usable as data base for crop-related applications, and to identify potential gaps in the data space.