Enhance Random Forest Classifier for high accuracy URL phishing detection using lexical structure

Aliyu Ibrahim Sulaiman
Ibrahim Abdullahi Aliyu
Nidhi Tyagi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Phishing is one most critical area in cybersecurity, which uses URLs to mislead users in revealing sensitive information like login credentials, financial information, and organizational sensitive data for malicious intent from the attackers. Blacklist, whitelist-based and heuristic approaches have effectively reduced the attack, but struggles when dealing with real-time and deceptive URLs. The study present Enhanced Random Forest as machine learning based framework that uses extensive lexical features for a phishing detection mechanism that produces high accuracy. The system is designed to extract feature from URL like digit frequency, number of dots, subdomains, suspicious keywords, and special characters to determine the phishing, while avoid relying on external services like domains, and webpage contents enabling lightweight and real-time detection. The system is trained with a comprehensive dataset of both phishing and legitimate URLs, and preprocessed to balance, encode the labels, and handle missing values. This enable the system to learn from unseen behaviour that may arise in the future. The process is optimized by tuning hyperparameters to minimize overfitting and enhance generalization. The enhanced ensemble model outperforms other machine learning classifiers based on the experimental evaluation by achieving high performance rate in term of accuracy, recall balance, and reduced high occurrence of false positive rate. The features has further influence lexical indicators by enhancing the interpretability in classification decisions. The result validate that when the ensemble machine learning is attached with strong lexical feature engineering, it will provide computationally efficient, and scalable solution that can be deployed on cybersecurity environments.

Version published to 10.21203/rs.3.rs-9053171/v1 on Research Square
Mar 20, 2026

High-Performance Phishing Email Detection Using Hybrid Machine Learning and Deep Learning Approaches

This article has 3 authors:
1. Mohamed Khayati
2. Driss Ait Omar
3. Mohamed Baslam
This article has no evaluationsLatest version Apr 7, 2026
Dual-Input Fusion Deep Learning Framework for URL-Based Phishing Detection

This article has 3 authors:
1. Muhammad Ibrahim Isah
2. Nasir Muhammad Auwa
3. Ruchi Holker
This article has no evaluationsLatest version Apr 13, 2026
Confidence-Aware Pseudo-Labeling via Unsupervised Ensemble Consensus for Fraud Detection Contribution

This article has 4 authors:
1. Daniel Agyekum Amakye
2. Joseph Dadzie
3. Nana Yaw Duodu
4. Albert Mainu Tawiah
This article has no evaluationsLatest version Apr 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

High-Performance Phishing Email Detection Using Hybrid Machine Learning and Deep Learning Approaches

Dual-Input Fusion Deep Learning Framework for URL-Based Phishing Detection

Confidence-Aware Pseudo-Labeling via Unsupervised Ensemble Consensus for Fraud Detection Contribution