Word Segmentation of Ancient Tamil Text extracted from inscriptions

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The absence of word boundaries between words in scriptio continuo languages hinder the development of NLP models for such languages. The objective of this research is to facilitate the building of NLP models for scriptio continuo languages by designing a word segmentation model for predicting word boundaries between letters in sentences, focusing particularly on ancient Tamil scripts. We have utilized a N-gram Naive Bayes model to predict the existence of word boundaries between two characters in a scriptio continuo text. We trained and assessed the model on a dataset of ancient Tamil writing, achieving an accuracy of 91.28%. Efficiently segmenting ancient Tamil texts not only helps preserve and comprehend historical manuscripts, but it also enables advancements in automated text segmentation. This model will assist archaeologists in constructing NLP models utilizing ancient Tamil, allowing for the extraction of significant information from ancient Tamil manuscripts without the need for a language expert. Additional research may be undertaken to examine more effective techniques for word segmentation with better performance, managing scripts from several centuries and developing models for additional languages.

Article activity feed