BantuLM: Enhancing Cross-Lingual Learning in the Bantu Language Family

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper outlines methods for improving Bantu languages through the application of Natural Language Processing techniques. We trained a Large Language Model known as Bidirectional Encoder Representations from Transformers for the understanding of 18 Bantu languages. More precisely, we pre-trained the model using an unsupervised corpus obtained using pseudo-labeling. This pre-training task aims to comprehend the latent structures of these languages owing to an attention mechanism that enables a deeper understanding of the context. We then conducted various experiments on five downstream tasks: Language Identification, Sentiment Analysis, News Classification, Named Entity Recognition and Text Summarization. Finally, we proposed to test the effectiveness of using multilingualism in a few closely related languages instead of leveraging a vast amount of data and multiple languages that are not necessarily related. In fact, we conducted experiments on unseen languages belonging to the Bantu family and we found that the model demonstrates better ability understanding them due to their similarities to the languages used for pre-training.

Article activity feed