Multilingual Detection of Irregular Migration Discourse on X and Telegram Using Transformer-based Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rise of Online Social Networks has reshaped global discourse, enabling real-time conversations on complex issues such as irregular migration. Yet the informal, multilingual, and often noisy nature of content on platforms like X (formerly Twitter) and Telegram presents significant challenges for reliable automated analysis. This study extends previous work by introducing an expanded multilingual NLP framework for detecting irregular migration discourse at scale. The dataset is enriched to include five languages (English, French, Greek, Turkish, and Arabic) and newly incorporates Telegram messages, while rule-based annotation is performed using TF-IDF–enhanced multilingual keyword lists. We evaluate a broad range of approaches, including traditional machine learning classifiers, SetFit sentence-embedding models, fine-tuned mBERT transformers, and a Large Language Model (GPT-4o). The results show that GPT-4o achieves the highest performance, with F1-scores reaching 0.91 in French and 0.90 in Greek, while SetFit outperforms mBERT in specific multilingual settings. The findings highlight the effectiveness of transformer-based and large-language-model approaches, particularly in low-resource or linguistically heterogeneous environments. Overall, the proposed framework demonstrates strong potential for multilingual monitoring of migration-related discourse, offering practical value for digital policy, early-warning mechanisms, and crisis informatics.