A Pilot Study on Multilingual Detection of Irregular Migration Discourse on X and Telegram Using Transformer-Based Models

Dimitrios Taranis
Gerasimos Razis
Ioannis Anagnostopoulos

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rise of Online Social Networks has reshaped global discourse, enabling real-time conversations on complex issues such as irregular migration. Yet the informal, multilingual, and often noisy nature of content on platforms like X (formerly Twitter) and Telegram presents significant challenges for reliable automated analysis. This study presents an exploratory multilingual natural language processing (NLP) framework for detecting irregular migration discourse across five languages. Conceived as a pilot study addressing extreme data scarcity in sensitive migration contexts, this work evaluates transformer-based models on a curated multilingual corpus. It provides an initial baseline for monitoring informal migration narratives on X and Telegram. We evaluate a broad range of approaches, including traditional machine learning classifiers, SetFit sentence-embedding models, fine-tuned multilingual BERT (mBERT) transformers, and a Large Language Model (GPT-4o). The results show that GPT-4o achieves the highest performance overall (F1-score: 0.84), with scores reaching 0.89 in French and 0.88 in Greek. While mBERT excels in English, SetFit outperforms mBERT in low-resource settings, specifically in Arabic (0.79 vs. 0.70) and Greek (0.88 vs. 0.81). The findings highlight the effectiveness of transformer-based and large-language-model approaches, particularly in low-resource or linguistically heterogeneous environments. Overall, the proposed framework provides an initial, compact benchmark for multilingual detection of irregular migration discourse under extreme, low-resource conditions. The results should be viewed as exploratory indicators of model behavior on this synthetic, small-scale corpus, not as statistically generalizable evidence or deployment-ready tools. In this context, “multilingual” refers to robustness across different linguistic realizations of identical migration narratives under translation, rather than coverage of organically diverse multilingual public discourse.

Version published to 10.3390/electronics15020281
Jan 8, 2026
Version published to 10.20944/preprints202511.1193.v1
Nov 18, 2025

Large Language Models for Continual Relation Extraction

This article has 3 authors:
1. Sefika Efeoglu
2. Adrian Paschke
3. Sonja Schimmler
This article has no evaluationsLatest version Jan 6, 2026
Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research

This article has 10 authors:
1. Derguene Mbaye
2. Tatiana D. P. Mbengue
3. Madoune R. Seye
4. Moussa Diallo
5. Mamadou L. Ndiaye
6. Dimitri S. Adjanohoun
7. Djiby Sow
8. Cheikh S. Wade
9. Jean-Claude B. Munyaka
10. Jerome Chenal
This article has no evaluationsLatest version Jan 15, 2026
Part-of-Speech Tagging for the Kangri Language Using CRF and BiLSTM Models: A Comprehensive Comparative Study

This article has 1 author:
1. Prateek Kaushal
This article has no evaluationsLatest version Jan 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large Language Models for Continual Relation Extraction

Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research

Part-of-Speech Tagging for the Kangri Language Using CRF and BiLSTM Models: A Comprehensive Comparative Study