Hugging Face as a Data Space for Agricultural Datasets: A PRISMA-Based Systematic Analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

(1) Background: Hugging Face is one of the largest platforms for machine-learning datasets, hosting collections of all kinds beyond its core focus on natural language processing. Whether and how these datasets can be leveraged for agricultural informatics is an open question. (2) Methods: A systematic data-space analysis structured by the PRISMA 2020 methodology was conducted. Using the search terms “farming” and “agriculture”, 128 datasets were identified on the platform, of which 126 could be fully analysed. (3) Results: Datasets cover mostly crops (42 %). English dominates (71 %); 13 languages are represented in total. The distribution of dataset sizes is strongly right-skewed (mean 156,346 entries; median 1,000). Parquet is the most common format (43 %); 92 % of datasets appear to contain human–LLM dialogues. (4) Conclusions: The available agricultural datasets on Hugging Face are thematically and qualitatively heterogeneous. Future work should develop prototypes to test if the available datasets are usable as data base for crop-related applications, and to identify potential gaps in the data space.

Article activity feed