Tenyidie Named Entity Recognition - Corpus creation and Machine/Deep Learning applications
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The Tenyidie language, a.k.a, the Angami language is a low-resource languagebelonging to the Tibeto-Burman language family and is considered a major language of Nagaland in the north-eastern part of India. Among the many NaturalLanguage Processing (NLP) tasks, named entity recognition (NER) is an important task in which named entities such as person, organization, location, etc, areidentified and find its applications in many other applications such as classifying content for news providers, recommendation systems, sentiment analysis, etc.To the best of the authors’ knowledge, this is the first attempt at building NERfor the Tenyidie language. The main aim of this research is to develop and evaluate the Named Entity Recognition (NER) annotated corpus for the TenyidieLanguage. In this work, a NER annotated dataset of 10,000 sentences (211,364tokens) for Tenyidie Language comprising of 5,208 named entities (699 persons,2,089 organizations, and 2,420 location entities) has been created. This paperalso applies the Machine Learning/Deep Learning techniques to the created NERdataset for the Tenyidie Language. For deep learning, we have explored different word embedding methods like word2vec, GloVe, fasttext, and BERT. In ourexperiments conducted, we achieved the best f1-score using the BERT-BASE(cased) model. The main contributions of this research are the creation of an NER annotated dataset for the Tenyidie language and the evaluation of the NERdataset using different learning techniques such as CRF, BLSTM, including thestate-of-the-art BERT model.