Enhance Random Forest Classifier for high accuracy URL phishing detection using lexical structure
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Phishing is one most critical area in cybersecurity, which uses URLs to mislead users in revealing sensitive information like login credentials, financial information, and organizational sensitive data for malicious intent from the attackers. Blacklist, whitelist-based and heuristic approaches have effectively reduced the attack, but struggles when dealing with real-time and deceptive URLs. The study present Enhanced Random Forest as machine learning based framework that uses extensive lexical features for a phishing detection mechanism that produces high accuracy. The system is designed to extract feature from URL like digit frequency, number of dots, subdomains, suspicious keywords, and special characters to determine the phishing, while avoid relying on external services like domains, and webpage contents enabling lightweight and real-time detection. The system is trained with a comprehensive dataset of both phishing and legitimate URLs, and preprocessed to balance, encode the labels, and handle missing values. This enable the system to learn from unseen behaviour that may arise in the future. The process is optimized by tuning hyperparameters to minimize overfitting and enhance generalization. The enhanced ensemble model outperforms other machine learning classifiers based on the experimental evaluation by achieving high performance rate in term of accuracy, recall balance, and reduced high occurrence of false positive rate. The features has further influence lexical indicators by enhancing the interpretability in classification decisions. The result validate that when the ensemble machine learning is attached with strong lexical feature engineering, it will provide computationally efficient, and scalable solution that can be deployed on cybersecurity environments.