A Multimodal Phishing Website Detection System Using Explainable Artificial Intelligence Technologies

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The purpose of the present study is to improve the efficiency of phishing web resource detection through multimodal analysis and using methods of explainable artificial intelligence. We propose a late-fusion architecture in which independent specialized models process four modalities and are combined using weighted voting. The first branch uses CatBoost for URL features and metadata; the second uses CNN1D for the symbolic level of URL representation; the third uses a transformer based on pre-trained CodeBERT for the HTML code of the homepage; and the fourth uses EfficientNet-B7 for page screenshot analysis. SHAP, Grad-CAM, and attention matrices are used for interpreting decisions; a local LLM generates a consolidated textual explanation. A prototype system based on a microservice architecture with integration into SOC has been developed. This integration enables streaming processing and reproducible validation. Computational experiments using our own updated dataset and the public MTLP dataset show high performance: F1 score of up to 0.989 on our own dataset and 0.953 on MTLP; multimodal fusion consistently outperforms single-modal baseline models. The practical significance of this approach for zero-day detection and false-positive reduction by aligning features across modalities and their explainability was demonstrated. All limitations and operational aspects (data drift, adversarial resilience, LLM latency) of the proposed prototype were presented. Authors outlined an area for further research.

Article activity feed