Word segmentation of ancient Tamil text extracted from inscriptions

S. Sandeep
S. Sanjith
Bharadwaj Sudarsan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The absence of word boundaries between words in scriptio continua script hinders the development of NLP models for such scripts. The objective of this research is to facilitate the building of NLP models for scriptio continua scripts by designing a word segmentation model for predicting word boundaries between characters in sentences, focusing particularly on ancient Tamil scripts. We have utilized an NGRAM Naive Bayes model to predict the existence of word boundaries between two characters in a scriptio continua text. We trained and assessed the model on a dataset of ancient Tamil writing, achieving an accuracy of 91.28%. Efficiently segmenting ancient Tamil texts not only helps preserve and comprehend historical manuscripts, but it also enables advancements in automated text segmentation. This model will assist archeologists in constructing NLP models utilizing ancient Tamil, allowing for the extraction of significant information from ancient Tamil manuscripts without the need for a language expert. Additional research may be undertaken to examine more effective techniques for word segmentation with better performance, managing scripts from several centuries, and developing models for additional scripts.

Version published to 10.1038/s40494-025-01612-2
Mar 31, 2025
Version published to 10.21203/rs.3.rs-4901928/v1 on Research Square
Sep 17, 2024

Ekantipur-15Y: A Longitudinal Benchmark Corpus and Semantic Analysis of Nepali News (2010 - 2025)

This article has 2 authors:
1. Diwash Mainali
2. Utsav Mainali
This article has no evaluationsLatest version Mar 3, 2026
Character Semantic-Phonetic Structure Enhance Language Models in Classical Chinese

This article has 4 authors:
1. Bolin Chang
2. Bin Li
3. Zhixing Xu
4. Shiyan Ou
This article has no evaluationsLatest version Mar 16, 2026
Attention Amplification in Multilingual LLMs: Why Script Representation Matters

This article has 3 authors:
1. Yash Mishra
2. Suyash Mishra
3. Kedarnath senapati
This article has no evaluationsLatest version Feb 25, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Ekantipur-15Y: A Longitudinal Benchmark Corpus and Semantic Analysis of Nepali News (2010 - 2025)

Character Semantic-Phonetic Structure Enhance Language Models in Classical Chinese

Attention Amplification in Multilingual LLMs: Why Script Representation Matters