Afaan Oromo News Text Classification Using Deep Learning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The recent development of the internet has significantly increased the availability and accessibility of Afaan Oromo texts online. Alongside the rapidly growing volume of information resources, there is a rising demand for more effective methods to find, filter, and organize these resources. Automatic text classification presents a viable solution to this challenge. Text classification, also known as text categorization, refers to the process of assigning predefined labels to text documents. This study uses deep learning algorithms with word embeddings for classifying Afaan Oromo news texts. Since feature extraction in news articles is often complex, deep learning provides a more effective approach compared to traditional methods. Earlier approaches typically relied on the bag-of-words model, which represents text as isolated words but ignores word order, an important factor in news classification. While these earlier models had relatively low time complexity, they failed to capture the context and semantic relationships between words. As the number of features and classes increased, their accuracy declined significantly. This study utilizes a dataset comprising 6,110 newly collected and annotated news articles for model training. Additionally, approximately 1,731,856 unannotated words were scraped from the Afaan Oromo news domain to develop a pre-trained word embedding model. Various natural language processing tasks, including text preprocessing steps such as normalization, tokenization, cleaning, and stop-word removal, were performed to prepare the data. For word representation, the Word2Vec embedding model, which predicts probabilistic word contexts, was selected due to its superior accuracy compared to FastText and other embedding approaches. Finally, the performance of the developed models was evaluated and compared. The CNN model achieved the highest accuracy of 98.4% and a precision of 98.4%, while the LSTM and BiLSTM models attained accuracies of 95% and 97.28%, with corresponding precisions of 94% and 97.36%, respectively.

Article activity feed