Robust Yorùbá Named Entity Recognition through Simple Mixed Training

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Yorùbá named entity recognition (NER) is sensitive to missing tone marks and in-line English, both common in Nigerian social and news text. Using the Yorùbá split of MasakhaNER 2.0, we quantify these effects and present a minimal fix. A standard xlm-roberta-base fine tune scores F1= 0.832 on clean test, falls to 0.584 when diacritics are stripped, and remains stable under light code-switch (0.834) measured with seqeval. We then train on a 50–50 mix of the original training set and a de-diacritised copy, keeping BIO tags intact. This lifts the no-diacritics test F1 to 0.842 while keeping clean at 0.854 and code-switch at 0.857. Per-entity analysis shows the largest gains for date and per. Results align with prior work on Yorùbá diacritic restoration and with evidence that Yorùbá–English code-switch is frequent in Nigeria. We release an end-to-end notebook hosted on GitHub to support reuse. The method is simple, cheap and effective, and can serve as a baseline for responsible NLP on low-resource African languages.

Article activity feed