Developing a Dataset and Benchmark for Poetry Generation in Low-Resource Languages
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study presents the development of a comprehensive dataset and benchmark for poetry generation in low-resource languages, addressing a critical gap in the field of natural language processing (NLP) and creative AI. As transformer-based models have demonstrated remarkable capabilities in generating text, their application to low-resource languages remains underexplored. This research aims to curate a diverse corpus of poetic texts across various forms and themes, encompassing cultural nuances and linguistic features unique to these languages. We outline the methodologies employed in dataset collection, including the selection of representative poetic forms and the involvement of native speakers to ensure authenticity and richness. Additionally, we establish a set of evaluation metrics tailored to assess the quality of generated poetry, focusing on thematic coherence, stylistic diversity, and adherence to poetic structures. Preliminary experiments with transformer architectures reveal both the potential and challenges of generating poetry in low-resource contexts, highlighting the importance of dataset quality in influencing model performance. The findings underscore the necessity for further research in this domain, advocating for the development of inclusive and representative datasets that can enhance the creative capabilities of AI in diverse linguistic landscapes. This work not only contributes to the advancement of NLP but also fosters cultural preservation and appreciation through the lens of generative poetry.