Developing a Dataset and Benchmark for Poetry Generation in Low-Resource Languages

James Henderson
Logan Blair

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study presents the development of a comprehensive dataset and benchmark for poetry generation in low-resource languages, addressing a critical gap in the field of natural language processing (NLP) and creative AI. As transformer-based models have demonstrated remarkable capabilities in generating text, their application to low-resource languages remains underexplored. This research aims to curate a diverse corpus of poetic texts across various forms and themes, encompassing cultural nuances and linguistic features unique to these languages. We outline the methodologies employed in dataset collection, including the selection of representative poetic forms and the involvement of native speakers to ensure authenticity and richness. Additionally, we establish a set of evaluation metrics tailored to assess the quality of generated poetry, focusing on thematic coherence, stylistic diversity, and adherence to poetic structures. Preliminary experiments with transformer architectures reveal both the potential and challenges of generating poetry in low-resource contexts, highlighting the importance of dataset quality in influencing model performance. The findings underscore the necessity for further research in this domain, advocating for the development of inclusive and representative datasets that can enhance the creative capabilities of AI in diverse linguistic landscapes. This work not only contributes to the advancement of NLP but also fosters cultural preservation and appreciation through the lens of generative poetry.

Version published to 10.20944/preprints202506.1984.v1
Jun 25, 2025

Age-graded Language Variation in Rakhine Language

This article has 2 authors:
1. Md Tasin Abir
2. Fariha Alam Chowdhury
This article has no evaluationsLatest version Aug 27, 2025
A Comparative Analysis of Tokenization Methods for Sinhala Natural Language Processing

This article has 1 author:
1. Ransaka Ravihara
This article has no evaluationsLatest version Aug 7, 2025
Vibe Coding in Vernacular Contexts: A Comprehensive Study on Tamil and Global Implications for Multilingual Programming Education

This article has 2 authors:
1. N. R. Chilambarasan
2. K. Naresh Kumar
This article has no evaluationsLatest version Sep 12, 2025

Listed in

Abstract

Article activity feed

Related articles

Age-graded Language Variation in Rakhine Language

A Comparative Analysis of Tokenization Methods for Sinhala Natural Language Processing

Vibe Coding in Vernacular Contexts: A Comprehensive Study on Tamil and Global Implications for Multilingual Programming Education