Ekantipur-15Y: A Longitudinal Benchmark Corpus and Semantic Analysis of Nepali News (2010 - 2025)

Diwash Mainali
Utsav Mainali

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper introduces Ekantipur-15Y, a long-scale longitudinal corpus of Nepali news articles spanning from 2010 to 2025. As Nepali is considered a low-resource language, the lack of a clean and temporally diverse dataset has been a barrier for the development of robust Natural Language Processing (NLP) models. We collected and cleaned 109,704 unique articles with approximately 14.3 million tokens from Ekantipur. The corpus is validated using Zipf's law confirming linguistic integrity and Heap's law demonstrating continuous growth of vocabulary without plateauing. Furthermore, the semantic analysis successfully detects the major historical events in the context of Nepal, including the 2015 Earthquake and the COVID-19 pandemic validating the accuracy of the dataset. Finally, a baseline is established for text classification, where a Linear Support Vector Machine (SVM) achieves an accuracy of 74.50%, significantly outperforming Naive Bayes and Logistic Regression.

Version published to 10.21203/rs.3.rs-8630749/v1 on Research Square
Mar 3, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed