A Cross-Modal Attention for Detecting Hidden Online Gambling Promotion in Multilingual Multimodal Watermarked Advertising Images

Abdul Azzam Ajhari
Rizal Fathoni Aji
Aprinaldi Jasa Mantau
Wisnu Jatmiko
Mahardhika Pratama

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Online gambling operators increasingly evade regulation by concealing promotional content within watermarked advertising images across social media and compromised domains. Traditional text-centric monitoring fails in these scenarios, particularly in multilingual environments where visual obfuscation masks critical semantic cues. This paper proposes a robust hybrid multimodal framework that explicitly models fine-grained interactions between OCR-extracted text and visual structures. Our architecture leverages a Vision Transformer (ViT) for spatial feature encoding and XLM-RoBERTa for cross-lingual semantic representation, integrated via a text-guided cross-modal attention (CMA) mechanism. This allows the model to "attend" to specific image regions based on extracted textual tokens, effectively uncovering hidden promotional signals. Testing on a newly curated dataset of 4,485 manually verified multilingual screenshots, the framework achieves an Accuracy of 0.9947 and an F1-score of 0.9947, consistently outperforming late-fusion baselines while maintaining performance comparable to the strongest unimodal configuration. Our findings reveal that while visual cues provide strong complementary discriminative signals, CMA ensures robustness against OCR noise and linguistic variation. This study provides a scalable, high-precision solution for cross-border regulatory monitoring in adversarial digital ecosystems.

Version published to 10.21203/rs.3.rs-8925590/v1 on Research Square
Feb 27, 2026

Multimodal Social Media Fake News Detection Using RoBERTa and Vision Transformer Encoders with Reliability Aware Adaptive Fusion

This article has 3 authors:
1. Ramesh Kumar Bhukya
2. Ayush Shukla
3. Nirbhay Singh
This article has no evaluationsLatest version Mar 19, 2026
TERN: Type-Aware Evidence Reasoning for Multimodal Fake News Detection

This article has 5 authors:
1. HongYu Jin
2. Mingshu Zhang
3. Yaxuan Wang
4. Yuechuan Zhang
5. Bin Wei
This article has no evaluationsLatest version Apr 7, 2026
An Efficient Hybrid Deep Learning Approach for Detecting Online Abusive Language

This article has 4 authors:
1. Vuong M. Ngo
2. Cach N. Dang
3. Kien V. Nguyen
4. Mark Roantree
This article has no evaluationsLatest version Mar 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multimodal Social Media Fake News Detection Using RoBERTa and Vision Transformer Encoders with Reliability Aware Adaptive Fusion

TERN: Type-Aware Evidence Reasoning for Multimodal Fake News Detection

An Efficient Hybrid Deep Learning Approach for Detecting Online Abusive Language