Bridging Machine Learning and Semantic Web: A Case Study on Converting Hugging Face Metadata to RDF

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

At BioHackathon 2024 in Fukushima, we explored enhancing machine learning dataset metadata management by developing a tool for converting the Croissant ML format(Akhtar et al., 2024) from MLCommons, implemented on Hugging Face(The AI Community Building the Future, n.d.) and expressed in JSON-LD, into RDF(Manola et al., 2004) using Turtle format. This allowed us to load the data into triple stores like Qlever and Apache Jena Fuseki, improving interoperability and querying capabilities. RDF’s graph-based structure and SPARQL querying enable advanced integration and analysis across heterogeneous datasets, addressing challenges like blank nodes, undefined ontologies, and non-resolvable URIs in the Croissant schema. Despite its alignment with ML workflows, Croissant’s limited connection to controlled vocabularies and lack of resolvable URLs highlight interoperability challenges compared to other metadata standards. Feedback emphasized the need for more detailed metadata specifications, meaningful annotations, and extensibility while recognizing Croissant’s potential for advancing metadata management in machine learning and bioinformatics. This approach underscores the value of standardizing and extending metadata frameworks and tools to facilitate dataset discovery, reproducibility, and integration across diverse domains.

Article activity feed