A Cross-Modal Attention for Detecting Hidden Online Gambling Promotion in Multilingual Multimodal Watermarked Advertising Images

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Online gambling operators increasingly evade regulation by concealing promotional content within watermarked advertising images across social media and compromised domains. Traditional text-centric monitoring fails in these scenarios, particularly in multilingual environments where visual obfuscation masks critical semantic cues. This paper proposes a robust hybrid multimodal framework that explicitly models fine-grained interactions between OCR-extracted text and visual structures. Our architecture leverages a Vision Transformer (ViT) for spatial feature encoding and XLM-RoBERTa for cross-lingual semantic representation, integrated via a text-guided cross-modal attention (CMA) mechanism. This allows the model to "attend" to specific image regions based on extracted textual tokens, effectively uncovering hidden promotional signals. Testing on a newly curated dataset of 4,485 manually verified multilingual screenshots, the framework achieves an Accuracy of 0.9947 and an F1-score of 0.9947, consistently outperforming late-fusion baselines while maintaining performance comparable to the strongest unimodal configuration. Our findings reveal that while visual cues provide strong complementary discriminative signals, CMA ensures robustness against OCR noise and linguistic variation. This study provides a scalable, high-precision solution for cross-border regulatory monitoring in adversarial digital ecosystems.

Article activity feed