Multiclass Text Classifications of Sindhi Newspaper Articles

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The classification of newspaper articles into predefined categories is a key challenge in Natural Language Processing (NLP), particularly for underexplored languages like Sindhi, which present unique linguistic complexities. This study developed a custom-curated Sindhi newspaper dataset containing 6,156 articles categorized into entertainment, sports, and technology. Four deep learning models Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and a hybrid CNN-LSTM model were trained using optimized hyperparameters and evaluated using metrics such as accuracy, precision, and recall. The dataset underwent rigorous preprocessing, including tokenization and normalization, to enhance model performance. Each model was trained using an 80-20 train-test split, and early stopping was employed to mitigate overfitting. The CNN and hybrid models achieved the highest accuracy of 96%, effectively capturing spatial and sequential patterns. LSTM closely followed with 95.85%, while the RNN lagged at 67%, highlighting its limitations with long-term dependencies. These results underline the potential of hybrid architectures and advanced sequence models for text classification tasks in low-resource languages like Sindhi. Source Code: https://github.com/rajavavek/Multiclass-Classification-of-Sindhi-Newspaper-Article.

Article activity feed