Automatic extraction and structuring of cultural heritage analysis process documentation from audio and text files
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Heritage science (HS) is an interdisciplinary field where collective knowledge emerges through an ongoing interplay between material objects and a wide range of research approaches that encompasses both the Humanities and experimental sciences. In this domain, the data management challenges are compounded by the strong heterogeneity of documentary sources, analytical data, and processes mobilized for condition reporting, analysis, monitoring or conservation purposes. Provenance metadata and paradata are essential for ensuring data reliability. Such documentation provides invaluable information on acquisition contexts and subsequent reuse possibilities. However, producing it rigorously is time-consuming, as the required information is diverse, context-dependent, and increasingly difficult to recover as time passes. In light of the massive daily data production in this field, developing methods to streamline data enrichment procedures is a clear priority. To address the risk of losing large amounts of undocumented data, the METAREVE project proposes a lightweight solution to help HS communities extract the key descriptive elements needed for minimal data understanding. Based on Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU), it takes the form of a web application that automatically documents scientific activities related to cultural heritage, drawing from common outputs such as expert reports, research articles, or even audio recordings of in situ acquisition processes. This approach has been implemented within the digital ecosystem developed by the EquipEx + ESPADON project.