A Pilot Study on Multilingual Detection of Irregular Migration Discourse on X and Telegram Using Transformer-Based Models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rise of Online Social Networks has reshaped global discourse, enabling real-time conversations on complex issues such as irregular migration. Yet the informal, multilingual, and often noisy nature of content on platforms like X (formerly Twitter) and Telegram presents significant challenges for reliable automated analysis. This study presents an exploratory multilingual natural language processing (NLP) framework for detecting irregular migration discourse across five languages. Conceived as a pilot study addressing extreme data scarcity in sensitive migration contexts, this work evaluates transformer-based models on a curated multilingual corpus. It provides an initial baseline for monitoring informal migration narratives on X and Telegram. We evaluate a broad range of approaches, including traditional machine learning classifiers, SetFit sentence-embedding models, fine-tuned multilingual BERT (mBERT) transformers, and a Large Language Model (GPT-4o). The results show that GPT-4o achieves the highest performance overall (F1-score: 0.84), with scores reaching 0.89 in French and 0.88 in Greek. While mBERT excels in English, SetFit outperforms mBERT in low-resource settings, specifically in Arabic (0.79 vs. 0.70) and Greek (0.88 vs. 0.81). The findings highlight the effectiveness of transformer-based and large-language-model approaches, particularly in low-resource or linguistically heterogeneous environments. Overall, the proposed framework provides an initial, compact benchmark for multilingual detection of irregular migration discourse under extreme, low-resource conditions. The results should be viewed as exploratory indicators of model behavior on this synthetic, small-scale corpus, not as statistically generalizable evidence or deployment-ready tools. In this context, “multilingual” refers to robustness across different linguistic realizations of identical migration narratives under translation, rather than coverage of organically diverse multilingual public discourse.

Article activity feed