Enhance Random Forest Classifier for high accuracy URL phishing detection using lexical structure

preprint OA: closed CC-BY-4.0
Full text 120,822 characters · extracted from preprint-html · click to expand
Enhance Random Forest Classifier for high accuracy URL phishing detection using lexical structure | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Enhance Random Forest Classifier for high accuracy URL phishing detection using lexical structure Aliyu Ibrahim Sulaiman, Ibrahim Abdullahi Aliyu, Nidhi Tyagi This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9053171/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Phishing is one most critical area in cybersecurity, which uses URLs to mislead users in revealing sensitive information like login credentials, financial information, and organizational sensitive data for malicious intent from the attackers. Blacklist, whitelist-based and heuristic approaches have effectively reduced the attack, but struggles when dealing with real-time and deceptive URLs. The study present Enhanced Random Forest as machine learning based framework that uses extensive lexical features for a phishing detection mechanism that produces high accuracy. The system is designed to extract feature from URL like digit frequency, number of dots, subdomains, suspicious keywords, and special characters to determine the phishing, while avoid relying on external services like domains, and webpage contents enabling lightweight and real-time detection. The system is trained with a comprehensive dataset of both phishing and legitimate URLs, and preprocessed to balance, encode the labels, and handle missing values. This enable the system to learn from unseen behaviour that may arise in the future. The process is optimized by tuning hyperparameters to minimize overfitting and enhance generalization. The enhanced ensemble model outperforms other machine learning classifiers based on the experimental evaluation by achieving high performance rate in term of accuracy, recall balance, and reduced high occurrence of false positive rate. The features has further influence lexical indicators by enhancing the interpretability in classification decisions. The result validate that when the ensemble machine learning is attached with strong lexical feature engineering, it will provide computationally efficient, and scalable solution that can be deployed on cybersecurity environments. Phishing Cybersecurity Machine Learning Random Forest Lexical Structure Figures Figure 1 Figure 2 Figure 3 1. Introduction Phishing as the one of the deadly threat in the world of cybersecurity has remain among the top cyber threat, which uses deceptive web technologies and social engineering techniques to gain unauthorized details and sensitive information, such company financial records, individual account information, credentials, and other related confidential data. The advancement in technology and electronic services, such as e-government, e-commerce, online transactions, and online banking, has fueled the phishing to automate the technique in using technologies that are more sophisticated, which are difficult to detect to impassionate and steal individuals’, groups’, or organizational data. According to the reports, the hackers embed malicious URLs into emails, SMS (smishing), adverts, clone websites, and online campaign to impassionate services that are legitimate, which increase the success rate of the attack [ 1 ]. In addition, the emergence of dynamically generated URLs and zero-day phishing URLs has made the traditional detection methodologies inefficient, which require the systems to be regularly updated with any identifiable malicious URLs for it to be effective and prevent any form of illegal data theft [ 2 ]. Heuristic approach affair to be a solution by detecting malicious URLs using suspicious keywords. With the employment of deceptive terms in the URL like shortening, substitution of characters, or domain spoofing, the technique affair to be ineffective in detection process and resulting in high false positive rates [ 3 ]. The emergence of machine learning (ML) has provided sophisticated solutions capable of learning automatically from different form of patterns and tactical change of the attack. Lexical features in particular, which analyzes the structural pattern of URLs, such as suspicious keywords, URL length, availability of IP address, special characters, availability of HTTP/HTTPS, and digit frequency to determine the validity of the URL, without the need of the DNS resolution and webpage contents [ 4 ]. This makes the approach more appropriate for real- time deployment and more computationally efficient. Random Forest is an example of ensemble leaning models, which is one of the machine learning techniques that strongly performed while handling cybersecurity task involving classification, due its nature in handling features with high dimensional spaces, reduces overfitting in the model detection system, and improve stability in prediction mechanisms [ 5 ]. Recent studies shows that, optimized random forest when carefully combined with engineered lexical features, enhances the accuracy of phishing detection compared to other single classifiers like Naïve Bayes or Logistics Regression [ 6 ]. In which most of the single classifiers rely on limited or very less hyperparameters, leaving the needs in optimizing the system performance. With all these combined, the research proposed the use of enhanced Random Forest Classifier URL-based phishing detection system that enriched in extracting lexical structure and tuning systematic hyperparameters, which improve the interpretability, accuracy, and robustness in URL classification. The proposed model aim to offer scalable, lightweight, and high accurate solution competent to be used in browsers as a security tool, enterprise security infrastructures, and email filtering system. The reminder of the paper is structured as follows: Section 2 contain the review of the related works, Section 3 discusses the methodology, dataset, and system architecture used in the system, Section 4 discusses the experimental result analyzed from the model, and Section 5, discusses the conclusion of the research. 2. Related Works Phishing attack is one of the most threat in cybersecurity, which uses techniques to trick a user in revealing sensitive information like user account credentials, organization financial details, and login credentials. As the technology is growing faster, the attackers are using new deceptive techniques that traditional security measure became inefficient in providing effective and timely protection. As a result, a high range of phishing detection method were identified which were categorically divided into, heuristic, machine learning, and blacklisted-based approaches [ 7 ], [ 8 ]. 2.1 Blacklist and Rule-Based Phishing Detection Whitelist and Blacklist mechanism were the earliest techniques used in URL comparison against the malicious or trusted website databases. The methods were effectively working in identifying legitimate and phishing websites that were reported, but are unable to in detecting zero-day and newly created phishing attack [ 9 ]. Empirical research have shown that phishing URLs have limited lifespan, which highlighted the limitations of whitelist and blacklisted approaches in detecting phishing dynamic URLs [ 10 ]. As a result, the technique became in effective and causes delay in validating legitimate URLs. To improve the efficiency in the phishing detection approaches, Heuristic-driven and rule-based techniques were introduced, which manually checking the structure and features in the URL. The system uses features like special characters, IP addresses, and suspicious words presence in the URL to determine the weather it is legitimate or phishing site. However, changing the URL can easily can easily get around in the approach, as such make the processes inflexible, which might leads in making high rate positive-false results [ 8 ]. 2.2 Machine Learning-Based Phishing URL Detection Machine learning is one of the emerging technology that is widely used in phishing detection. The techniques have features that overcome the limitations in traditional approaches, heuristic-driven, and rule-based approaches. The system yield encouraging results by using supervised learning algorithms such as Support Vector Machines (SVM), Decision Tress, Logistic Regression, and Naïve Bayes [ 11 ], [ 12 ]. The techniques outsmart the other processes due the ability in learning from complex pattern in the available data and forecast the result to the newly generated URLs [ 11 ]. 2.3 Random Forest for Phishing Detection Ensemble learning methods like Random forest have proven to be more effective machine learning classifier in handling phishing detection. The process combined multiple decision tree uses in prediction that improve accuracy in classification and reduce overfitting [ 13 ]. According to several studies, in URL-based phishing datasets applying Random Forest has consistently outfaced other individual classifiers like Naïve Bayes, and Support Vector Machine [ 14 ]. The approached suited in the system due to its robustness, capacity, and scalability in handling high-dimensional features. 2.4 Lexical Feature-Based URL Analysis The goal of lexical feature-based analysis focuses on extracting discriminative behaviour from the URL strings without depending on webpage content or information about the host. Number of dots in the URL, length of URL, suspicious words, appearance of special characters, IP address, and subdomains are the examples of lexical features used in the process [ 15 ]. Research has demonstrated that, pairing lexical feature with ensemble classifier like Random Forest, the tendency of achieving high accuracy [ 16 ]. To achieve more higher accuracy, advanced lexical structure were introduced, which include domain entropy and typo squatting that are used in authenticating the URL. Using these features have greatly strengthen the performance of the technique against complex phishing URL services [ 17 ]. 2.5 Deep Learning and Hybrid Detection Techniques Deep learning and Convolutional Neural Networks (CNNs) were the most recent techniques used in URL phishing detection. The models eliminate manual processes by us learning automatically from the raw URLs hierarchical representations [ 18 ], [ 19 ]. Limited datasets is hindering the effectiveness of deep learning models as it require huge amount of data with high computational resources, longer training time, and well labelled to provide real-time and constrained in the environment with high resources. Implementation of hybrid techniques such as combining content-based, host-based, and lexical features is one of proposed methods that improve the accuracy in phishing detection. The systems works on architecture with high complexity but they rely more external data sources, which practically reducing their lightweight deployment [ 20 ]. Table 2.0 Summary of research gaps Study Approach Methodology Used Key Strengths Identified Limitations (Gap) Contribution of the Proposed Study Blacklist-based systems URL matching against known phishing databases Simple and fast detection for known threats Ineffective against zero-day attacks, delayed updates, and short URL lifespan Eliminates reliance on blacklists by using machine learning based real time URL classification Heuristic / Rule- based methods Manually crafted URL rules and patterns Easy to implement; interpretable Rigid rules; high false positives; easily bypassed Uses data-driven learning instead of static rules Classical ML models (NB, SVM, LR) Supervised learning with handcrafted features Better generalization than heuristics Lower accuracy; sensitive to feature selection Employs an ensemble Random Forest model for improved robustness Basic Random Forest classifiers Ensemble of decision trees High accuracy; reduced overfitting Limited feature optimization; default hyperparameters Introduces enhanced Random Forest with tuned hyperparameters Lexical feature- based approaches URL string analysis only Lightweight; fast; content independent Feature sets are often limited; they ignore advanced obfuscation Uses comprehensive lexical features, including entropy and suspicious keyword patterns Content-based and host-based systems HTML parsing, DNS, and WHOIS analysis High detection accuracy High latency; dependency on external services Maintains a lightweight design without content or host dependencies Deep learning approaches (CNN, DNN) Automatic feature learning from raw URLs High accuracy Computationally expensive; poor real-time performance Achieves comparable accuracy with lower computational cost Hybrid phishing detection models Combination of lexical, content, and host features Strong performance Increased complexity and deployment overhead Proposes a scalable and simple architecture Typo squatting and entropy- based detection Lexical similarity and randomness analysis Effective against domain spoofing Not widely integrated into ML pipelines Integrates advanced lexical indicators within the Random Forest framework 2.7 Research Gap and Motivation Despite a high number of research in addressing the issue, several number of gaps remain widely available in the sector. In which most of the techniques rely on host-based and content based features in extracting legitimate URLs, this reduces applicability in real-time detection and introduces latency in the system. Despite the strength of deep learning in addressing the problem, the techniques are computationally expensive and challenging in implementing lightweight environments [ 21 ]. Furthermore, the lexical structure and features implemented with random forest are limited, which resulted in less significant in optimization and engineering the process to counter the problems and provide stable solutions [ 19 ]. To address these challenges, the study proposes enhanced Random Forest classifier that uses a wide and advance range of lexical features to maintain minimal computational overhead and achieve high accuracy in detecting the phishing and legitimate URLs. 3. Proposed System URL Phishing is the biggest phishing attack in the world through many models, and solutions were developed to tackle the threats. Domain-based, heuristic approach, and content-based were among the several solutions provided to defend it. The study employ a machine learning based approach that exclusively relies on lexical structure extracted from the URLs with enhanced Random Forest Algorithm, which help the models to achieve high accuracy with low computational overhead. 3.1 Overall System Architecture To support effective, real-time, and scalable URL classification, modular and sequential architecture was suggested as phishing detection system. The system solely focused on the URL string behaviour, which eliminate the need to depend on the external services and reducing latency compared to host-based and content-based URL detection systems. The stages in the system workflow are as follows: Each of the stages work independently but connected together to tackle the real-world cybersecurity applications such as email filters, browser extensions, intrusion prevention systems (IPS), and intrusion detection systems (IDS). 3.2 Dataset Description To provide effective and efficient system the study combined both legitimate and phishing URLs obtained from public repositories. The dataset labeled indicating legitimate (0) or phishing (1). To ensure accurate modelling, the dataset contains huge dataset with different tactics to of URL patterns. To prevent bias and ensure robust data evaluation, the dataset is mixed up and divided into training and testing standard ratio, which help the system to read and understand from complex phishing URLs. 3.3 Lexical Feature The proposed engineering process to detect the phishing is lexical feature, which is extracted directly from the URL. The system do not access the metadata, webpage content, or DNS records. 3.3.1 Lexical Features Extracted The features extracted from the URLs are as follows: Table 3.1 Features extracted Feature extracted Description Length of URL The number of characters used in the domain/URL Subdomain count Number of subdomain used in the URL Dots Number of dots used Entropy measures Check the domain name randomness HTTP and HTTPs To indicate the site security by checking the availability of HTTP and HTTPs in the domain Use of IP Address To check the binary indicator for IP-based URLs Typosquatting Check the availability of popular key branding Suspicious keywords Checking the words like secure, verify, login, etc. Frequency of Digits in the URL To check the number of digits in the domain Use of special characters Check the presence of special characters like ‘?, =, /, -, etc. Table 3.2 Sample of lexical features extracted from URL 0 url_length num_dots num_hyphens num_at num_equal num_slash num_question num_ampersand num_percent 37 3 0 0 0 3 0 0 0 1 77 1 0 0 0 5 0 0 0 2 126 4 1 0 3 5 1 2 0 3 18 2 0 0 0 2 0 0 0 4 55 2 2 0 0 5 0 0 0 3.3.2 Feature Selection Justifications Phishing URLs often check the features like excessive use special characters, digit frequency, number of subdomains and dots presence in the URL, abnormal structure in the patterns, and use of suspicious keywords. These quality features are captured and effectively used by the lexical feature to provide robust discrimination and maintain efficient computation in the detection processes. 3.4 Data Preprocessing To train the model, the features extracted must follow preprocessing to guarantee the consistency and quality of the data. The steps are indicated as follows: Label Encoding : Labelling the dataset accordingly, phishing URL labelled 1 and legitimate 0 Handling Missing Values : replacing the missing by using default values. Feature Scaling : The numeric features are normalized using a standard scale. Class Balancing : Imbalance classes are handled using sampling oversampling techniques. The preprocessing steps enhance the model performance generally by covering all the missing values that might be uncounted, unbalancing classes, labelling, and well-established scaled feature dataset that is ready to train the model. 3.5 Enhanced Random Forest Classification Model 3.5.1 Overview of Random Forest Random Forest is one of the strongest ensemble learning methodology that is widely used due its excellent strongest classification and performance in modelling data. By using majority voting to determine the final and output and building multiple decision trees on the random subset of the data and feature, the classifier improve the prediction accuracy and performance, while reducing model variance in complex data classification [ 22 ] . 3.5.2 Enhancement Model Strategy To boost the performance and accuracy over standard Random Forest implementation, the model is improved through the followings: Optimizing the number of trees Depth tree controlling to avoid overfitting Feature optimization by sampling at each split Class balanced learning The above strategy help the model to do an in-depth checkup to the phishing mechanism to determine the complex patterns applied, while significant processing and computational cost. The following table indicate the key hyperparameters used in the model. 3.6 Description of Algorithm Algorithm Enhanced Random Forest-Based Phishing URL Detection Input : Active URL or Dataset of the URL Output Phishing or Legitimate URL Classification Load the URL dataset labeled phishing and legitimate From the URL extract the lexical features Use preprocessing method to normalize the dataset Divide the dataset for training and testing sets. Train the model with extracted features using Random forest classifier The model predict the class from the user-provided or unseen URLs Display the result of the classification 3.7 Evaluation Strategy of the model To achieve effective standard the proposed system is evaluated using well-established performance matric that is extensively utilized in cybersecurity research studies. The metrics evaluated are as follows: Table 3.3 Summary of Model Evaluation Strategy Metric Description Accuracy It is used to measure the overall correctness classification Precision Provide the identified phishing URLs in the dataset to reduce the inconveniences in the prediction mechanism to reduce the false alarm Recall Provide the actual URLs that are not legitimate (phishing) F1-score Balance the matric between precision and recall Confusion Matrix Provide detailed breakdown information on the error analysis ROC-AUC Reflecting the model ability in ranking phishing URLs by providing better class separability 3.8 Methodological Advantages The proposed approached has provided several of advantages, which includes: High accuracy through ensemble learning. Easy to integrate into existing security system. Robust against zero-phishing attacks. Lightweight architecture and content-independent. Fewer computational overhead suitable for real-time use. Generally, combining these approaches, the methodology provides scalable, efficient, and effective URL phishing detection system with the use of lexical features with enhanced random forest classifier. The study proposed a systematic ways to address the limitations identified in the literature by providing a real-world deployment system. 4. Results and Discussion 4.1 Descriptive Analysis of the Dataset The original dataset was collected from different sources and contains only two columns being the URL and the label of whether the URL is legitimate or not legitimate. Being from different sources, that brings about different type of labels with some using the ‘legitimate’ or ‘good’ for legitimate URLs and ‘phishing’ or ‘bad’ for illegitimate URL, after cleaning and preprocessing, we classified the legitimate URLs as 0 and the illegitimate as 1. The final processed dataset used for the research were 560,776 URLs with 398,639 real URLs (71.1%) and 162,137 phishing URLs (28.9%). Structural characteristics of URLs revealed a difference between the two classes. Based on the model the legitimate URLs has an average of 45.8 characters, and 63.6 characters for the phishing URLs, which shows increases in 38.9% in the phishing URLs length. This highlighted that, for the attackers to create phishing URLs complex, which are deceptive in nature it use to be very long. Table 4.1 Dataset Characteristics Characteristic Value Total Samples 560,776 Legitimate URLs 398,639 (71.1%) Phishing URLs 162,137 (28.9%) Avg. Length (Legitimate) 45.8 characters Avg. Length (Phishing) 63.6 characters 4.2 Evaluation of the Baseline Model When compared with non-linear relationships between ensemble model and features, random forest classifier reduces the overfitting risk unlike single decision trees, this is why the it was chosen as baseline model. The model was able to learn from 80% starting with 100 estimators from the dataset. The other 20% was set aside for testing. Stratified sampling was used during the train-test split process to make sure that the model training reflected this real-world distribution and didn't introduce too much bias. The classification report revealed differentiated performance across classes. The legitimate URLs categorized as class 0 achieved 0.94 of precision, 0.94 of recall, and 0.95 of F1-score, which yielded high performance in reducing false positive rates. The phishing URLs categorized as class 1 achieved 0.91 of precision, 0.85 recall, and 0.87 of F1-score and overall the model achieved 93.01% accuracy. Table 4.1 Summary of Baseline Model Class F1-Score Precision Recall Legitimate (0) 0.95 0.94 0.96 Phishing (1) 0.87 0.91 0.85 Weighted Average 0.93 0.93 0.93 The high precision rate for the legitimate results demonstrate that the model is quite good at lowering false positives, which means that real user traffic is nearly never banned. However, there is about 15% chance of bad URLs probably those that used advanced evasion techniques or compromised legitimate domains were able to get past detection. Performance of the Enhanced Random Forest Model The 100 estimators used in the Random Forest model was particularly successful to attain excellent accuracy and generalizability. The model's design inherently eliminates overfitting by using bagging and random feature selection at each split. This is why it works so well with data it has not seen before. The improved feature engineering process got a full set of URL characteristics, such as structural features (like URL length and special character counts), protocol indicators (like the presence of HTTPS), domain analysis (like TLD detection and subdomain counting), special characters count (like dots and hyphen) and semantic features (like suspicious keyword detection and brand typo-squatting indicators). This multi-dimensional feature space let the model find many different phishing patterns. A phishing simulation was conducted on the model, and the model correctly flagged a fake phishing URL that looked like a banking portal ( https://www.secure-login-bank.com/verify ) with a phishing confidence of 87.0%. The keywords used in the URL are among the most suspicious keywords used for phishing from the URL analyzed, the hyphenated domain structure, and the path depth all played a role in this high-confidence classification. The model correctly classified “openai.com/verify” as legitimate with 53.6% confidence, despite the presence of a verification- related keyword. 4.3 Feature Importance and Interpretability Analysis The 54 different lexical and semantic features we developed on the dataset worked effectively with fresh data. The length of the URL was very important, which is consistent with the difference in average lengths between phishing and legitimate URLs. Counting unique characters, especially dots, hyphens, and slashes, made it easier to discern phishing URLs apart. This is because they often use sophisticated subdomain hierarchies or hidden paths. Domain entropy, which tells how random (entropy) a domain name is, was helpful for detecting Algorithmically Generated Domains (AGDs), which is a frequent mechanism for botnets to govern their infrastructure. Typosquatting Distance also illustrates an important engineering could detect domains that looked like other domains. URLs that were 1 or 2 Levenshtein distance away from well-known brand names, such as "gooogle.com," were strongly associated to the phishing label. The number of suspicious terms was also important in figuring out which keywords were high-risk for intent analysis. 4.4 Discussion of Findings The results of this study show that a Random Forest classifier can detect phishing in real time with a very high level of accuracy (93%) using only lexical features. This performance is possible because it does not have the latency that comes with third-party reputation queries or scraping page content. This makes it good for use in networks with a lot of traffic. However, the analysis also shows the "grey area" of classification. The 15% false negative rate shows that attackers are changing to shorter, "cleaner" URLs that do not have obvious heuristic triggers. The probabilistic analysis also shows that real websites that use security-related words in their URLs are at risk of being falsely flagged. The 11% recall gap in the model accuracy suggests that certain phishing URLs successfully mimic legitimate characteristics. This discovery highlight the necessity of ongoing model updates to counteract the progression of phishing strategies. Improvements in the future should focus on improving the accuracy and closing the recall gap between the legitimate and phishing URLs. Possible ways to do this can employ usage of WHOIS registration data to punish analyze registered domains and character-level embedding using deep learning architecture like CNNs or LSTMs to better capture semantic patterns in the URL string that manual feature engineering might miss. In conclusion, the results in this chapter suggest that machine learning, like Random Forest classification with careful features implementation, is a good technique to detect phishing URLs. The 93% accuracy achieved on the model proved that the model is ready to be deployed on real world application. This study significantly advances the current initiative to safeguard consumers from phishing attempts via intelligent automated detection systems, despite certain limits and future enhancement plans. Conclusion The evolution of digital era has increasingly make phishing attack to be among the most cyber threats that damages the reputation of individual, groups, governments, and organizations through which the cyber attackers uses fake websites, malicious URLs, and deceptive emails to illegally access financial records, login credentials, and organizational sensitive information for their malicious intents. The attackers uses online platforms, e-commerce, and social media to target a victim and exploit any identifiable vulnerabilities by clicking on the malicious link provided to launch their phishing campaigns. Traditional processes like whitelist and blacklist affair to be ineffective to countermeasure the attacks due to their inability in real-time and zero-day phishing URL attacks, which depends on manually updating records whenever a malicious URL is identified. Similarly, heuristic-based technique, which uses suspicious keywords to detect malicious URLs, generates high false positive rates when dealing with deceptive links that were altered by the attackers. Machine learning, which addresses the limitations by providing a sophisticated way to learn from complex patterns, real-time detection, and adapt to the tactical changes of the attack has gained more interest in providing real solutions to the phishing attack. Lexical features analysis in particular, which uses URL lexical structures like density of the digits, URL length, suspicious keywords, IP address, HTTP/HTTPS presence, special characters, and number of subdomain, provide high performing lightweight and real-time detection system independent from external resources such web content or domain quires to detect malicious URL. To provide efficient, scalable, and accurate computational phishing detection system, this study introduced the use of enhanced random forest classifier that provide optimize solution with high classification power when combined with meticulous lexical features. The study aims to provide a reliable solution by improving detection performance through important features analysis and avoid reduced overfitting to preserve the interpretability of the model. The study provided useful and affordable URL phishing detection technique, contributing ensemble-learning framework capable for real-time solution, which can deployed as browser extension, email filters, and enterprises solutions to protect security infrastructures against any form of URL phishing attack. Declarations Ethics Approval This study does not involve human participants, animal subjects, or sensitive personal data. Accordingly, formal ethics approval from an Institutional Review Board (IRB) or Ethics Committee was not required. Not applicable. Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Not applicable. Author Contribution A.I.S. and I.A.A. wrote the main manuscript text and performed the data analysis, feature engineering, and implementation of the Random Forest model. A.I.S. conducted the experiments and prepared the figures and tables. N.T. supervised the research work, validated the methodology, and provided critical revisions to the manuscript. All authors reviewed and approved the final manuscript. Data Availability The dataset is availabe at https://www.kaggle.com/datasets/taruntiwarihp/phishing-site-urls References N. Abdelhamid, A. Ayesh, and F. Thabtah, “Phishing detection based Associative Classification data mining,” Expert Syst. Appl. , vol. 41, no. 13, pp. 5948–5959, Oct. 2014, doi: 10.1016/j.eswa.2014.03.019 . N. Abdelhamid, F. Thabtah, and H. Abdel-jaber, “Phishing detection: A recent intelligent machine learning comparison based on models content and features,” in 2017 IEEE International Conference on Intelligence and Security Informatics (ISI) , Beijing, China: IEEE, Jul. 2017, pp. 72–77. doi: 10.1109/ISI.2017.8004877 . W. Guo, Q. Wang, H. Yue, H. Sun, and R. Q. Hu, “Efficient Phishing URL Detection Using Graph- based Machine Learning and Loopy Belief Propagation,” Jan. 12, 2025, arXiv : arXiv:2501.06912. doi: 10.48550/arXiv.2501.06912 . A. K. Dutta, “Detecting phishing websites using machine learning technique,” PLOS ONE , vol. 16, no. 10, p. e0258361, Oct. 2021, doi: 10.1371/journal.pone.0258361 . “Website Phishing Detection Using Machine Learning Techniques,” J. Stat. Appl. Probab. , vol. 13, no. 1, pp. 119–129, Jan. 2024, doi: 10.18576/jsap/130108 . Y. Kumar and B. Subba, “A lightweight machine learning based security framework for detecting phishing attacks,” in 2021 International Conference on COMmunication Systems & NETworkS (COMSNETS) , Bangalore, India: IEEE, Jan. 2021, pp. 184–188. doi: 10.1109/COMSNETS51098.2021.9352828 . A. A. Albishri and M. M. Dessouky, “A Comparative Analysis of Machine Learning Techniques for URL Phishing Detection,” Eng. Technol. Appl. Sci. Res. , vol. 14, no. 6, pp. 18495–18501, Dec. 2024, doi: 10.48084/etasr.8920 . J. Vega, D. Shevchyk, and Y. Cheng, “A Literature Survey of Phishing and Its Countermeasures”. S. Marchal, G. Armano, T. Grondahl, K. Saari, N. Singh, and N. Asokan, “Off-the-Hook: An Efficient and Usable Client-Side Phishing Prevention Application,” IEEE Trans. Comput. , vol. 66, no. 10, pp. 1717–1733, Oct. 2017, doi: 10.1109/TC.2017.2703808 . A. Oest et al. , “PhishTime: Continuous Longitudinal Measurement of the Effectiveness of Anti- phishing Blacklists”. N. F. Almujahid, M. A. Haq, and M. Alshehri, “Comparative evaluation of machine learning algorithms for phishing site detection,” PeerJ Comput. Sci. , vol. 10, p. e2131, Jun. 2024, doi: 10.7717/peerj-cs.2131 . M. Abutaha, M. Ababneh, K. Mahmoud, and S. A.-H. Baddar, “URL Phishing Detection using Machine Learning Techniques based on URLs Lexical Analysis,” in 2021 12th International Conference on Information and Communication Systems (ICICS) , Valencia, Spain: IEEE, May 2021, pp. 147–152. doi: 10.1109/ICICS52457.2021.9464539 . R. Verma and K. Dyer, “On the Character of Phishing URLs: Accurate and Robust Statistical Learning Classifiers,” in Proceedings of the 5th ACM Conference on Data and Application Security and Privacy , San Antonio Texas USA: ACM, Mar. 2015, pp. 111–122. doi: 10.1145/2699026.2699115 . X. Zhou and R. M. Verma, “Phishing Sites Detection from a Web Developer’s Perspective Using Machine Learning”. O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishing detection from URLs,” Expert Syst. Appl. , vol. 117, pp. 345–357, Mar. 2019, doi: 10.1016/j.eswa.2018.09.029 . B. Banik and A. Sarma, “Lexical Feature Based Feature Selection and Phishing URL Classification Using Machine Learning Techniques,” in Machine Learning, Image Processing, Network Security and Data Sciences , vol. 1241, A. Bhattacharjee, S. Kr. Borgohain, B. Soni, G. Verma, and X.-Z. Gao, Eds., in Communications in Computer and Information Science, vol. 1241., Singapore: Springer Singapore, 2020, pp. 93–105. doi: 10.1007/978-981-15-6318-8_9 . A. Moubayed, M. Injadat, A. Shami, and H. Lutfiyya, “DNS Typo-squatting Domain Detection: A Data Analytics & Machine Learning Based Approach,” in 2018 IEEE Global Communications Conference (GLOBECOM) , Dec. 2018, pp. 1–7. doi: 10.1109/GLOCOM.2018.8647679 . Q. E. U. Haq, M. H. Faheem, and I. Ahmad, “Detecting Phishing URLs Based on a Deep Learning Approach to Prevent Cyber-Attacks,” Appl. Sci. , vol. 14, no. 22, p. 10086, Nov. 2024, doi: 10.3390/app142210086 . A. Aljofey, Q. Jiang, Q. Qu, M. Huang, and J.-P. Niyigena, “An Effective Phishing Detection Model Based on Character Level Convolutional Neural Network from URL,” Electronics , vol. 9, no. 9, p. 1514, Sep. 2020, doi: 10.3390/electronics9091514 . S. Hamadouche, O. Boudraa, and M. Gasmi, “Combining Lexical, Host, and Content-based features for Phishing Websites detection using Machine Learning Models,” ICST Trans. Scalable Inf. Syst. , Apr. 2024, doi: 10.4108/eetsis.4421 . N. Innab et al. , “Phishing Attacks Detection Using Ensemble Machine Learning Algorithms,” Comput. Mater. Contin. , vol. 80, no. 1, pp. 1325–1345, 2024, doi: 10.32604/cmc.2024.051778 . J. Chen, X. Wang, and F. Lei, “Data-driven multinomial random forest: a new random forest variant with strong consistency,” J. Big Data , vol. 11, no. 1, p. 34, Feb. 2024, doi: 10.1186/s40537-023-00874-6 . Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9053171","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":604741649,"identity":"ac717a5b-8c5f-449f-9379-84f7f37f3e7c","order_by":0,"name":"Aliyu Ibrahim Sulaiman","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA/ElEQVRIiWNgGAWjYBACAwbGBgkQg5+ZsfHBByCDjZ1YLZLtzc2GM0BamAlqYWAAazE4c7xNmgfEIqTFnP1w442fe+rkGW4kNkjb/Nomz8fMwPjhYw5uLZY9ic2WPc/YDBtnJDYY5/bdNmxjZmCWnLkNj8MOJLZJ8BzgYWyWSGxIzu25zQjUwsbMi0/L+Ydtkn8OSNi3AbUctuy5bU9Yy41EoK8PGCT28BxsbGb4cTuRCC0Pm61lDiQkz2BvbGbsbbid3MbM2IzfL+fTH958c6DOdv9h9uc/fvy5bTu/vfngh494tKACxjYw2UCsehD4Q4riUTAKRsEoGCkAANmKVqQ3v6HdAAAAAElFTkSuQmCC","orcid":"","institution":"Shobhit University","correspondingAuthor":true,"prefix":"","firstName":"Aliyu","middleName":"Ibrahim","lastName":"Sulaiman","suffix":""},{"id":604741650,"identity":"c6b9bd29-ecde-4f28-9aa3-6903b28b9dba","order_by":1,"name":"Ibrahim Abdullahi Aliyu","email":"","orcid":"","institution":"Shobhit University","correspondingAuthor":false,"prefix":"","firstName":"Ibrahim","middleName":"Abdullahi","lastName":"Aliyu","suffix":""},{"id":604741651,"identity":"bafda1ee-6f3e-457c-89fe-f233b6b93a3b","order_by":2,"name":"Nidhi Tyagi","email":"","orcid":"","institution":"Shobhit University","correspondingAuthor":false,"prefix":"","firstName":"Nidhi","middleName":"","lastName":"Tyagi","suffix":""}],"badges":[],"createdAt":"2026-03-06 18:23:20","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9053171/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9053171/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":105022128,"identity":"26226075-698a-4ae9-9c13-f785423e27a7","added_by":"auto","created_at":"2026-03-20 03:13:41","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":565997,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eFig 3.1 system workflow\u003c/strong\u003e\u003c/em\u003e\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-9053171/v1/67b207f1e72dcd529ea1dab1.png"},{"id":105022127,"identity":"fe259e73-eacf-455f-a6ac-26fe2c35c918","added_by":"auto","created_at":"2026-03-20 03:13:41","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":225194,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eFig 3.2 Data Preprocessing\u003c/strong\u003e\u003c/em\u003e\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-9053171/v1/25712b0f562aad522012227f.png"},{"id":105022129,"identity":"10a40470-e79e-44c3-a345-f4d34230a14a","added_by":"auto","created_at":"2026-03-20 03:13:41","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":45780,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eFig 3.2 Key Hyperparameters\u003c/strong\u003e\u003c/em\u003e\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-9053171/v1/141f84cf99aa485077b0a5ed.png"},{"id":105022130,"identity":"971f3ffa-5e23-4b7c-b492-3da839536d84","added_by":"auto","created_at":"2026-03-20 03:13:46","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2066331,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9053171/v1/d325d738-e157-4898-8499-e2ddebd4b5b9.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Enhance Random Forest Classifier for high accuracy URL phishing detection using lexical structure","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003ePhishing as the one of the deadly threat in the world of cybersecurity has remain among the top cyber threat, which uses deceptive web technologies and social engineering techniques to gain unauthorized details and sensitive information, such company financial records, individual account information, credentials, and other related confidential data. The advancement in technology and electronic services, such as e-government, e-commerce, online transactions, and online banking, has fueled the phishing to automate the technique in using technologies that are more sophisticated, which are difficult to detect to impassionate and steal individuals\u0026rsquo;, groups\u0026rsquo;, or organizational data. According to the reports, the hackers embed malicious URLs into emails, SMS (smishing), adverts, clone websites, and online campaign to impassionate services that are legitimate, which increase the success rate of the attack [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. In addition, the emergence of dynamically generated URLs and zero-day phishing URLs has made the traditional detection methodologies inefficient, which require the systems to be regularly updated with any identifiable malicious URLs for it to be effective and prevent any form of illegal data theft [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eHeuristic approach affair to be a solution by detecting malicious URLs using suspicious keywords. With the employment of deceptive terms in the URL like shortening, substitution of characters, or domain spoofing, the technique affair to be ineffective in detection process and resulting in high false positive rates [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. The emergence of machine learning (ML) has provided sophisticated solutions capable of learning automatically from different form of patterns and tactical change of the attack. Lexical features in particular, which analyzes the structural pattern of URLs, such as suspicious keywords, URL length, availability of IP address, special characters, availability of HTTP/HTTPS, and digit frequency to determine the validity of the URL, without the need of the DNS resolution and webpage contents [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. This makes the approach more appropriate for real- time deployment and more computationally efficient.\u003c/p\u003e \u003cp\u003eRandom Forest is an example of ensemble leaning models, which is one of the machine learning techniques that strongly performed while handling cybersecurity task involving classification, due its nature in handling features with high dimensional spaces, reduces overfitting in the model detection system, and improve stability in prediction mechanisms [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Recent studies shows that, optimized random forest when carefully combined with engineered lexical features, enhances the accuracy of phishing detection compared to other single classifiers like Na\u0026iuml;ve Bayes or Logistics Regression [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. In which most of the single classifiers rely on limited or very less hyperparameters, leaving the needs in optimizing the system performance.\u003c/p\u003e \u003cp\u003eWith all these combined, the research proposed the use of enhanced Random Forest Classifier URL-based phishing detection system that enriched in extracting lexical structure and tuning systematic hyperparameters, which improve the interpretability, accuracy, and robustness in URL classification. The proposed model aim to offer scalable, lightweight, and high accurate solution competent to be used in browsers as a security tool, enterprise security infrastructures, and email filtering system.\u003c/p\u003e \u003cp\u003eThe reminder of the paper is structured as follows: Section 2 contain the review of the related works, Section 3 discusses the methodology, dataset, and system architecture used in the system,\u003c/p\u003e \u003cp\u003eSection 4 discusses the experimental result analyzed from the model, and Section 5, discusses the conclusion of the research.\u003c/p\u003e"},{"header":"2. Related Works","content":"\u003cp\u003ePhishing attack is one of the most threat in cybersecurity, which uses techniques to trick a user in revealing sensitive information like user account credentials, organization financial details, and login credentials. As the technology is growing faster, the attackers are using new deceptive techniques that traditional security measure became inefficient in providing effective and timely protection. As a result, a high range of phishing detection method were identified which were categorically divided into, heuristic, machine learning, and blacklisted-based approaches [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e].\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Blacklist and Rule-Based Phishing Detection\u003c/h2\u003e \u003cp\u003eWhitelist and Blacklist mechanism were the earliest techniques used in URL comparison against the malicious or trusted website databases. The methods were effectively working in identifying legitimate and phishing websites that were reported, but are unable to in detecting zero-day and newly created phishing attack [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. Empirical research have shown that phishing URLs have limited lifespan, which highlighted the limitations of whitelist and blacklisted approaches in detecting phishing dynamic URLs [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. As a result, the technique became in effective and causes delay in validating legitimate URLs.\u003c/p\u003e \u003cp\u003eTo improve the efficiency in the phishing detection approaches, Heuristic-driven and rule-based techniques were introduced, which manually checking the structure and features in the URL. The system uses features like special characters, IP addresses, and suspicious words presence in the URL to determine the weather it is legitimate or phishing site. However, changing the URL can easily can easily get around in the approach, as such make the processes inflexible, which might leads in making high rate positive-false results [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Machine Learning-Based Phishing URL Detection\u003c/h2\u003e \u003cp\u003eMachine learning is one of the emerging technology that is widely used in phishing detection. The techniques have features that overcome the limitations in traditional approaches, heuristic-driven, and rule-based approaches. The system yield encouraging results by using supervised learning algorithms such as Support Vector Machines (SVM), Decision Tress, Logistic Regression, and Na\u0026iuml;ve Bayes [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e], [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. The techniques outsmart the other processes due the ability in learning from complex pattern in the available data and forecast the result to the newly generated URLs [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Random Forest for Phishing Detection\u003c/h2\u003e \u003cp\u003eEnsemble learning methods like Random forest have proven to be more effective machine learning classifier in handling phishing detection. The process combined multiple decision tree uses in prediction that improve accuracy in classification and reduce overfitting [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. According to several studies, in URL-based phishing datasets applying Random Forest has consistently outfaced other individual classifiers like Na\u0026iuml;ve Bayes, and Support Vector Machine [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. The approached suited in the system due to its robustness, capacity, and scalability in handling high-dimensional features.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Lexical Feature-Based URL Analysis\u003c/h2\u003e \u003cp\u003eThe goal of lexical feature-based analysis focuses on extracting discriminative behaviour from the URL strings without depending on webpage content or information about the host. Number of dots in the URL, length of URL, suspicious words, appearance of special characters, IP address, and subdomains are the examples of lexical features used in the process [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Research has demonstrated that, pairing lexical feature with ensemble classifier like Random Forest, the tendency of achieving high accuracy [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eTo achieve more higher accuracy, advanced lexical structure were introduced, which include domain entropy and typo squatting that are used in authenticating the URL. Using these features have greatly strengthen the performance of the technique against complex phishing URL services [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.5 Deep Learning and Hybrid Detection Techniques\u003c/h2\u003e \u003cp\u003eDeep learning and Convolutional Neural Networks (CNNs) were the most recent techniques used in URL phishing detection. The models eliminate manual processes by us learning automatically from the raw URLs hierarchical representations [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e], [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. Limited datasets is hindering the effectiveness of deep learning models as it require huge amount of data with high computational resources, longer training time, and well labelled to provide real-time and constrained in the environment with high resources.\u003c/p\u003e \u003cp\u003eImplementation of hybrid techniques such as combining content-based, host-based, and lexical features is one of proposed methods that improve the accuracy in phishing detection. The systems works on architecture with high complexity but they rely more external data sources, which practically reducing their lightweight deployment [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2.0\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSummary of research gaps\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStudy Approach\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMethodology Used\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eKey Strengths\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eIdentified Limitations (Gap)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eContribution of the Proposed Study\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBlacklist-based systems\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eURL matching against known phishing databases\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSimple and fast detection for known threats\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eIneffective against zero-day attacks, delayed updates, and short URL lifespan\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eEliminates reliance on blacklists by using machine learning based real time URL\u003c/p\u003e \u003cp\u003eclassification\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHeuristic / Rule- based methods\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eManually crafted URL rules and\u003c/p\u003e \u003cp\u003epatterns\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEasy to implement;\u003c/p\u003e \u003cp\u003einterpretable\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eRigid rules; high false positives;\u003c/p\u003e \u003cp\u003eeasily bypassed\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eUses data-driven learning instead\u003c/p\u003e \u003cp\u003eof static rules\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClassical ML models (NB, SVM, LR)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSupervised learning with handcrafted features\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBetter generalization than heuristics\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLower accuracy; sensitive to feature selection\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eEmploys an ensemble Random Forest model for improved\u003c/p\u003e \u003cp\u003erobustness\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Taba\" border=\"1\"\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBasic Random Forest classifiers\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEnsemble of decision trees\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHigh accuracy; reduced overfitting\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLimited feature optimization; default hyperparameters\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eIntroduces enhanced Random Forest with tuned\u003c/p\u003e \u003cp\u003ehyperparameters\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLexical feature- based approaches\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eURL string analysis only\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLightweight; fast; content independent\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eFeature sets are often limited; they ignore advanced obfuscation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eUses comprehensive lexical features, including entropy and suspicious keyword\u003c/p\u003e \u003cp\u003epatterns\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eContent-based and host-based systems\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHTML parsing, DNS, and WHOIS analysis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHigh detection accuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eHigh latency; dependency on external services\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMaintains a lightweight design without content or host\u003c/p\u003e \u003cp\u003edependencies\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeep learning approaches (CNN, DNN)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAutomatic feature learning from raw URLs\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHigh accuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eComputationally expensive; poor real-time performance\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAchieves comparable accuracy with lower computational\u003c/p\u003e \u003cp\u003ecost\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHybrid phishing detection models\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCombination of lexical, content, and host features\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eStrong performance\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eIncreased complexity and deployment overhead\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eProposes a scalable and simple architecture\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTypo squatting and entropy- based detection\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLexical similarity and randomness analysis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEffective against domain spoofing\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNot widely integrated into ML pipelines\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eIntegrates advanced lexical indicators within the Random Forest\u003c/p\u003e \u003cp\u003eframework\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.7 Research Gap and Motivation\u003c/h2\u003e \u003cp\u003eDespite a high number of research in addressing the issue, several number of gaps remain widely available in the sector. In which most of the techniques rely on host-based and content based features in extracting legitimate URLs, this reduces applicability in real-time detection and introduces latency in the system. Despite the strength of deep learning in addressing the problem, the techniques are computationally expensive and challenging in implementing lightweight environments [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Furthermore, the lexical structure and features implemented with random forest are limited, which resulted in less significant in optimization and engineering the process to counter the problems and provide stable solutions [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eTo address these challenges, the study proposes enhanced Random Forest classifier that uses a wide and advance range of lexical features to maintain minimal computational overhead and achieve high accuracy in detecting the phishing and legitimate URLs.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Proposed System","content":"\u003cp\u003eURL Phishing is the biggest phishing attack in the world through many models, and solutions were developed to tackle the threats. Domain-based, heuristic approach, and content-based were among the several solutions provided to defend it. The study employ a machine learning based approach that exclusively relies on lexical structure extracted from the URLs with enhanced Random Forest Algorithm, which help the models to achieve high accuracy with low computational overhead.\u003c/p\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Overall System Architecture\u003c/h2\u003e \u003cp\u003eTo support effective, real-time, and scalable URL classification, modular and sequential architecture was suggested as phishing detection system. The system solely focused on the URL string behaviour, which eliminate the need to depend on the external services and reducing latency compared to host-based and content-based URL detection systems.\u003c/p\u003e \u003cp\u003eThe stages in the system workflow are as follows:\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eEach of the stages work independently but connected together to tackle the real-world cybersecurity applications such as email filters, browser extensions, intrusion prevention systems (IPS), and intrusion detection systems (IDS).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Dataset Description\u003c/h2\u003e \u003cp\u003eTo provide effective and efficient system the study combined both legitimate and phishing URLs obtained from public repositories. The dataset labeled indicating legitimate (0) or phishing (1). To ensure accurate modelling, the dataset contains huge dataset with different tactics to of URL patterns.\u003c/p\u003e \u003cp\u003eTo prevent bias and ensure robust data evaluation, the dataset is mixed up and divided into training and testing standard ratio, which help the system to read and understand from complex phishing URLs.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Lexical Feature\u003c/h2\u003e \u003cp\u003eThe proposed engineering process to detect the phishing is lexical feature, which is extracted directly from the URL. The system do not access the metadata, webpage content, or DNS records.\u003c/p\u003e \u003cdiv id=\"Sec13\" class=\"Section3\"\u003e \u003ch2\u003e3.3.1 Lexical Features Extracted\u003c/h2\u003e \u003cp\u003eThe features extracted from the URLs are as follows:\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3.1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eFeatures extracted\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFeature extracted\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDescription\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLength of URL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eThe number of characters used in the domain/URL\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSubdomain count\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNumber of subdomain used in the URL\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDots\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNumber of dots used\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEntropy measures\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCheck the domain name randomness\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHTTP and HTTPs\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTo indicate the site security by checking the availability of HTTP\u003c/p\u003e \u003cp\u003eand HTTPs in the domain\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eUse of IP Address\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTo check the binary indicator for IP-based URLs\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTyposquatting\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCheck the availability of popular key branding\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSuspicious keywords\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChecking the words like secure, verify, login, etc.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFrequency of Digits in the\u003c/p\u003e \u003cp\u003eURL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTo check the number of digits in the domain\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eUse of special characters\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCheck the presence of special characters like \u0026lsquo;?, =, /, -, etc.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3.2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSample of lexical features extracted from URL\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"10\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003e0\u003c/b\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eurl_length\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003enum_dots\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003enum_hyphens\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003enum_at\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003enum_equal\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003enum_slash\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003enum_question\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003enum_ampersand\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c10\"\u003e \u003cp\u003enum_percent\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e37\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c10\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e1\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e77\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e2\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e126\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e3\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e4\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e55\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section3\"\u003e \u003ch2\u003e3.3.2 Feature Selection Justifications\u003c/h2\u003e \u003cp\u003ePhishing URLs often check the features like excessive use special characters, digit frequency, number of subdomains and dots presence in the URL, abnormal structure in the patterns, and use of suspicious keywords. These quality features are captured and effectively used by the lexical feature to provide robust discrimination and maintain efficient computation in the detection processes.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e3.4 Data Preprocessing\u003c/h2\u003e \u003cp\u003eTo train the model, the features extracted must follow preprocessing to guarantee the consistency and quality of the data. The steps are indicated as follows:\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eLabel Encoding\u003c/b\u003e: Labelling the dataset accordingly, phishing URL labelled 1 and legitimate 0\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eHandling Missing Values\u003c/b\u003e: replacing the missing by using default values.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eFeature Scaling\u003c/b\u003e: The numeric features are normalized using a standard scale.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eClass Balancing\u003c/b\u003e: Imbalance classes are handled using sampling oversampling techniques.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eThe preprocessing steps enhance the model performance generally by covering all the missing values that might be uncounted, unbalancing classes, labelling, and well-established scaled feature dataset that is ready to train the model.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e3.5 Enhanced Random Forest Classification Model\u003c/h2\u003e \u003cdiv id=\"Sec17\" class=\"Section3\"\u003e \u003ch2\u003e3.5.1 Overview of Random Forest\u003c/h2\u003e \u003cp\u003eRandom Forest is one of the strongest ensemble learning methodology that is widely used due its excellent strongest classification and performance in modelling data. By using majority voting to determine the final and output and building multiple decision trees on the random subset of the data and feature, the classifier improve the prediction accuracy and performance, while reducing model variance in complex data classification [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e] .\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section3\"\u003e \u003ch2\u003e3.5.2 Enhancement Model Strategy\u003c/h2\u003e \u003cp\u003eTo boost the performance and accuracy over standard Random Forest implementation, the model is improved through the followings:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eOptimizing the number of trees\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eDepth tree controlling to avoid overfitting\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eFeature optimization by sampling at each split\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eClass balanced learning\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eThe above strategy help the model to do an in-depth checkup to the phishing mechanism to determine the complex patterns applied, while significant processing and computational cost. The following table indicate the key hyperparameters used in the model.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003e3.6 Description of Algorithm\u003c/h2\u003e \u003cp\u003e \u003cstrong\u003eAlgorithm\u003c/strong\u003e \u003cp\u003e \u003cb\u003eEnhanced Random Forest-Based Phishing URL Detection Input\u003c/b\u003e: Active URL or Dataset of the URL\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eOutput\u003c/strong\u003e \u003cp\u003ePhishing or Legitimate URL Classification\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eLoad the URL dataset labeled phishing and legitimate\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eFrom the URL extract the lexical features\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eUse preprocessing method to normalize the dataset\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eDivide the dataset for training and testing sets.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eTrain the model with extracted features using Random forest classifier\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eThe model predict the class from the user-provided or unseen URLs\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eDisplay the result of the classification\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003e3.7 Evaluation Strategy of the model\u003c/h2\u003e \u003cp\u003eTo achieve effective standard the proposed system is evaluated using well-established performance matric that is extensively utilized in cybersecurity research studies. The metrics evaluated are as follows:\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3.3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSummary of Model Evaluation Strategy\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMetric\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDescription\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eIt is used to measure the overall correctness classification\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eProvide the identified phishing URLs in the dataset to reduce the\u003c/p\u003e \u003cp\u003einconveniences in the prediction mechanism to reduce the false alarm\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eProvide the actual URLs that are not legitimate (phishing)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eF1-score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBalance the matric between precision and recall\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eConfusion Matrix\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eProvide detailed breakdown information on the error analysis\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eROC-AUC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eReflecting the model ability in ranking phishing URLs by providing better\u003c/p\u003e \u003cp\u003eclass separability\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003e3.8 Methodological Advantages\u003c/h2\u003e \u003cp\u003eThe proposed approached has provided several of advantages, which includes:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eHigh accuracy through ensemble learning.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eEasy to integrate into existing security system.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eRobust against zero-phishing attacks.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eLightweight architecture and content-independent.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eFewer computational overhead suitable for real-time use.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eGenerally, combining these approaches, the methodology provides scalable, efficient, and effective URL phishing detection system with the use of lexical features with enhanced random forest classifier. The study proposed a systematic ways to address the limitations identified in the literature by providing a real-world deployment system.\u003c/p\u003e \u003c/div\u003e"},{"header":"4. Results and Discussion","content":"\u003cdiv id=\"Sec23\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Descriptive Analysis of the Dataset\u003c/h2\u003e \u003cp\u003eThe original dataset was collected from different sources and contains only two columns being the URL and the label of whether the URL is legitimate or not legitimate. Being from different sources, that brings about different type of labels with some using the \u0026lsquo;legitimate\u0026rsquo; or \u0026lsquo;good\u0026rsquo; for legitimate URLs and \u0026lsquo;phishing\u0026rsquo; or \u0026lsquo;bad\u0026rsquo; for illegitimate URL, after cleaning and preprocessing, we classified the legitimate URLs as 0 and the illegitimate as 1. The final processed dataset used for the research were 560,776 URLs with 398,639 real URLs (71.1%) and 162,137 phishing URLs (28.9%).\u003c/p\u003e \u003cp\u003eStructural characteristics of URLs revealed a difference between the two classes. Based on the model the legitimate URLs has an average of 45.8 characters, and 63.6 characters for the phishing URLs, which shows increases in 38.9% in the phishing URLs length. This highlighted that, for the attackers to create phishing URLs complex, which are deceptive in nature it use to be very long.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4.1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eDataset Characteristics\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCharacteristic\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eValue\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTotal Samples\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e560,776\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLegitimate URLs\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e398,639 (71.1%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePhishing URLs\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e162,137 (28.9%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAvg. Length (Legitimate)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e45.8 characters\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAvg. Length (Phishing)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e63.6 characters\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Evaluation of the Baseline Model\u003c/h2\u003e \u003cp\u003eWhen compared with non-linear relationships between ensemble model and features, random forest classifier reduces the overfitting risk unlike single decision trees, this is why the it was chosen as baseline model. The model was able to learn from 80% starting with 100 estimators\u003c/p\u003e \u003cp\u003efrom the dataset. The other 20% was set aside for testing. Stratified sampling was used during the train-test split process to make sure that the model training reflected this real-world distribution and didn't introduce too much bias. The classification report revealed differentiated performance across classes. The legitimate URLs categorized as class 0 achieved 0.94 of precision, 0.94 of recall, and 0.95 of F1-score, which yielded high performance in reducing false positive rates. The phishing URLs categorized as class 1 achieved 0.91 of precision, 0.85 recall, and 0.87 of F1-score and overall the model achieved 93.01% accuracy.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4.1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSummary of Baseline Model\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClass\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eF1-Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLegitimate (0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.94\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.96\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePhishing (1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.91\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWeighted Average\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe high precision rate for the legitimate results demonstrate that the model is quite good at lowering false positives, which means that real user traffic is nearly never banned. However, there is about 15% chance of bad URLs probably those that used advanced evasion techniques or compromised legitimate domains were able to get past detection.\u003c/p\u003e \u003cp\u003ePerformance of the Enhanced Random Forest Model\u003c/p\u003e \u003cp\u003eThe 100 estimators used in the Random Forest model was particularly successful to attain excellent accuracy and generalizability. The model's design inherently eliminates overfitting by using bagging and random feature selection at each split. This is why it works so well with data it has not seen before. The improved feature engineering process got a full set of URL characteristics, such as structural features (like URL length and special character counts), protocol indicators (like the presence of HTTPS), domain analysis (like TLD detection and subdomain counting), special characters count (like dots and hyphen) and semantic features (like suspicious keyword detection and brand typo-squatting indicators). This multi-dimensional feature space let the model find many different phishing patterns.\u003c/p\u003e \u003cp\u003eA phishing simulation was conducted on the model, and the model correctly flagged a fake phishing URL that looked like a banking portal (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.secure-login-bank.com/verify\u003c/span\u003e\u003cspan address=\"https://www.secure-login-bank.com/verify\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) with a phishing confidence of 87.0%. The keywords used in the URL are among the most suspicious keywords used for phishing from the URL analyzed, the hyphenated domain structure, and the path depth all played a role in this high-confidence classification. The model correctly classified \u0026ldquo;openai.com/verify\u0026rdquo; as legitimate with 53.6% confidence, despite the presence of a verification- related keyword.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec25\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Feature Importance and Interpretability Analysis\u003c/h2\u003e \u003cp\u003eThe 54 different lexical and semantic features we developed on the dataset worked effectively with fresh data. The length of the URL was very important, which is consistent with the difference in average lengths between phishing and legitimate URLs. Counting unique characters, especially dots, hyphens, and slashes, made it easier to discern phishing URLs apart. This is because they\u003c/p\u003e \u003cp\u003eoften use sophisticated subdomain hierarchies or hidden paths. Domain entropy, which tells how random (entropy) a domain name is, was helpful for detecting Algorithmically Generated Domains (AGDs), which is a frequent mechanism for botnets to govern their infrastructure. Typosquatting Distance also illustrates an important engineering could detect domains that looked like other domains. URLs that were 1 or 2 Levenshtein distance away from well-known brand names, such as \"gooogle.com,\" were strongly associated to the phishing label. The number of suspicious terms was also important in figuring out which keywords were high-risk for intent analysis.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section2\"\u003e \u003ch2\u003e4.4 Discussion of Findings\u003c/h2\u003e \u003cp\u003eThe results of this study show that a Random Forest classifier can detect phishing in real time with a very high level of accuracy (93%) using only lexical features. This performance is possible because it does not have the latency that comes with third-party reputation queries or scraping page content. This makes it good for use in networks with a lot of traffic. However, the analysis also shows the \"grey area\" of classification. The 15% false negative rate shows that attackers are changing to shorter, \"cleaner\" URLs that do not have obvious heuristic triggers. The probabilistic analysis also shows that real websites that use security-related words in their URLs are at risk of being falsely flagged. The 11% recall gap in the model accuracy suggests that certain phishing URLs successfully mimic legitimate characteristics. This discovery highlight the necessity of ongoing model updates to counteract the progression of phishing strategies.\u003c/p\u003e \u003cp\u003eImprovements in the future should focus on improving the accuracy and closing the recall gap between the legitimate and phishing URLs. Possible ways to do this can employ usage of WHOIS registration data to punish analyze registered domains and character-level embedding using deep learning architecture like CNNs or LSTMs to better capture semantic patterns in the URL string that manual feature engineering might miss.\u003c/p\u003e \u003cp\u003eIn conclusion, the results in this chapter suggest that machine learning, like Random Forest classification with careful features implementation, is a good technique to detect phishing URLs. The 93% accuracy achieved on the model proved that the model is ready to be deployed on real world application. This study significantly advances the current initiative to safeguard consumers from phishing attempts via intelligent automated detection systems, despite certain limits and future enhancement plans.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThe evolution of digital era has increasingly make phishing attack to be among the most cyber threats that damages the reputation of individual, groups, governments, and organizations through which the cyber attackers uses fake websites, malicious URLs, and deceptive emails to illegally access financial records, login credentials, and organizational sensitive information for their malicious intents. The attackers uses online platforms, e-commerce, and social media to target a victim and exploit any identifiable vulnerabilities by clicking on the malicious link provided to launch their phishing campaigns. Traditional processes like whitelist and blacklist affair to be ineffective to countermeasure the attacks due to their inability in real-time and zero-day phishing URL attacks, which depends on manually updating records whenever a malicious URL is identified. Similarly, heuristic-based technique, which uses suspicious keywords to detect\u003c/p\u003e \u003cp\u003emalicious URLs, generates high false positive rates when dealing with deceptive links that were altered by the attackers. Machine learning, which addresses the limitations by providing a sophisticated way to learn from complex patterns, real-time detection, and adapt to the tactical changes of the attack has gained more interest in providing real solutions to the phishing attack. Lexical features analysis in particular, which uses URL lexical structures like density of the digits, URL length, suspicious keywords, IP address, HTTP/HTTPS presence, special characters, and number of subdomain, provide high performing lightweight and real-time detection system independent from external resources such web content or domain quires to detect malicious URL. To provide efficient, scalable, and accurate computational phishing detection system, this study introduced the use of enhanced random forest classifier that provide optimize solution with high classification power when combined with meticulous lexical features. The study aims to provide a reliable solution by improving detection performance through important features analysis and avoid reduced overfitting to preserve the interpretability of the model. The study provided useful and affordable URL phishing detection technique, contributing ensemble-learning framework capable for real-time solution, which can deployed as browser extension, email filters, and enterprises solutions to protect security infrastructures against any form of URL phishing attack.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eEthics Approval\u003c/h2\u003e \u003cp\u003eThis study does not involve human participants, animal subjects, or sensitive personal data. Accordingly, formal ethics approval from an Institutional Review Board (IRB) or Ethics Committee was not required. Not applicable.\u003c/p\u003e \u003ch2\u003eFunding\u003c/h2\u003e \u003cp\u003eThis research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Not applicable.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eA.I.S. and I.A.A. wrote the main manuscript text and performed the data analysis, feature engineering, and implementation of the Random Forest model. A.I.S. conducted the experiments and prepared the figures and tables. N.T. supervised the research work, validated the methodology, and provided critical revisions to the manuscript. All authors reviewed and approved the final manuscript.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe dataset is availabe at https://www.kaggle.com/datasets/taruntiwarihp/phishing-site-urls\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eN. Abdelhamid, A. Ayesh, and F. Thabtah, \u0026ldquo;Phishing detection based Associative Classification data mining,\u0026rdquo; \u003cem\u003eExpert Syst. Appl.\u003c/em\u003e, vol. 41, no. 13, pp. 5948\u0026ndash;5959, Oct. 2014, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/j.eswa.2014.03.019\u003c/span\u003e\u003cspan address=\"10.1016/j.eswa.2014.03.019\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eN. Abdelhamid, F. Thabtah, and H. Abdel-jaber, \u0026ldquo;Phishing detection: A recent intelligent machine learning comparison based on models content and features,\u0026rdquo; in 2017 \u003cem\u003eIEEE International Conference on Intelligence and Security\u003c/em\u003e Informatics \u003cem\u003e(ISI)\u003c/em\u003e, Beijing, China: IEEE, Jul. 2017, pp. 72\u0026ndash;77. doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ISI.2017.8004877\u003c/span\u003e\u003cspan address=\"10.1109/ISI.2017.8004877\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eW. Guo, Q. Wang, H. Yue, H. Sun, and R. Q. Hu, \u0026ldquo;Efficient Phishing URL Detection Using Graph- based Machine Learning and Loopy Belief Propagation,\u0026rdquo; Jan. 12, 2025, \u003cem\u003earXiv\u003c/em\u003e: arXiv:2501.06912. doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2501.06912\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2501.06912\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eA. K. Dutta, \u0026ldquo;Detecting phishing websites using machine learning technique,\u0026rdquo; \u003cem\u003ePLOS ONE\u003c/em\u003e, vol. 16, no. 10, p. e0258361, Oct. 2021, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1371/journal.pone.0258361\u003c/span\u003e\u003cspan address=\"10.1371/journal.pone.0258361\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003e\u0026ldquo;Website Phishing Detection Using Machine Learning Techniques,\u0026rdquo; \u003cem\u003eJ. Stat. Appl. Probab.\u003c/em\u003e, vol. 13, no. 1, pp. 119\u0026ndash;129, Jan. 2024, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.18576/jsap/130108\u003c/span\u003e\u003cspan address=\"10.18576/jsap/130108\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eY. Kumar and B. Subba, \u0026ldquo;A lightweight machine learning based security framework for detecting phishing attacks,\u0026rdquo; in 2021 \u003cem\u003eInternational Conference on COMmunication Systems \u0026amp; NETworkS (COMSNETS)\u003c/em\u003e, Bangalore, India: IEEE, Jan. 2021, pp. 184\u0026ndash;188. doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/COMSNETS51098.2021.9352828\u003c/span\u003e\u003cspan address=\"10.1109/COMSNETS51098.2021.9352828\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eA. A. Albishri and M. M. Dessouky, \u0026ldquo;A Comparative Analysis of Machine Learning Techniques for URL Phishing Detection,\u0026rdquo; \u003cem\u003eEng. Technol. Appl. Sci. Res.\u003c/em\u003e, vol. 14, no. 6, pp. 18495\u0026ndash;18501, Dec. 2024, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48084/etasr.8920\u003c/span\u003e\u003cspan address=\"10.48084/etasr.8920\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJ. Vega, D. Shevchyk, and Y. Cheng, \u0026ldquo;A Literature Survey of Phishing and Its Countermeasures\u0026rdquo;.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eS. Marchal, G. Armano, T. Grondahl, K. Saari, N. Singh, and N. Asokan, \u0026ldquo;Off-the-Hook: An Efficient and Usable Client-Side Phishing Prevention Application,\u0026rdquo; \u003cem\u003eIEEE Trans. Comput.\u003c/em\u003e, vol. 66, no. 10, pp. 1717\u0026ndash;1733, Oct. 2017, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/TC.2017.2703808\u003c/span\u003e\u003cspan address=\"10.1109/TC.2017.2703808\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eA. Oest \u003cem\u003eet al.\u003c/em\u003e, \u0026ldquo;PhishTime: Continuous Longitudinal Measurement of the Effectiveness of Anti- phishing Blacklists\u0026rdquo;.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eN. F. Almujahid, M. A. Haq, and M. Alshehri, \u0026ldquo;Comparative evaluation of machine learning algorithms for phishing site detection,\u0026rdquo; \u003cem\u003ePeerJ Comput. Sci.\u003c/em\u003e, vol. 10, p. e2131, Jun. 2024, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.7717/peerj-cs.2131\u003c/span\u003e\u003cspan address=\"10.7717/peerj-cs.2131\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eM. Abutaha, M. Ababneh, K. Mahmoud, and S. A.-H. Baddar, \u0026ldquo;URL Phishing Detection using Machine Learning Techniques based on URLs Lexical Analysis,\u0026rdquo; in 2021 \u003cem\u003e12th International Conference on Information and Communication\u003c/em\u003e Systems \u003cem\u003e(ICICS)\u003c/em\u003e, Valencia, Spain: IEEE, May 2021, pp. 147\u0026ndash;152. doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ICICS52457.2021.9464539\u003c/span\u003e\u003cspan address=\"10.1109/ICICS52457.2021.9464539\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eR. Verma and K. Dyer, \u0026ldquo;On the Character of Phishing URLs: Accurate and Robust Statistical Learning Classifiers,\u0026rdquo; in \u003cem\u003eProceedings of the 5th ACM Conference on Data and Application Security and Privacy\u003c/em\u003e, San Antonio Texas USA: ACM, Mar. 2015, pp. 111\u0026ndash;122. doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1145/2699026.2699115\u003c/span\u003e\u003cspan address=\"10.1145/2699026.2699115\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eX. Zhou and R. M. Verma, \u0026ldquo;Phishing Sites Detection from a Web Developer\u0026rsquo;s Perspective Using Machine Learning\u0026rdquo;.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eO. K. Sahingoz, E. Buber, O. Demir, and B. Diri, \u0026ldquo;Machine learning based phishing detection from URLs,\u0026rdquo; \u003cem\u003eExpert Syst. Appl.\u003c/em\u003e, vol. 117, pp. 345\u0026ndash;357, Mar. 2019, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/j.eswa.2018.09.029\u003c/span\u003e\u003cspan address=\"10.1016/j.eswa.2018.09.029\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eB. Banik and A. Sarma, \u0026ldquo;Lexical Feature Based Feature Selection and Phishing URL Classification Using Machine Learning Techniques,\u0026rdquo; in \u003cem\u003eMachine Learning, Image Processing, Network Security and Data Sciences\u003c/em\u003e, vol. 1241, A. Bhattacharjee, S. Kr. Borgohain, B. Soni, G. Verma, and X.-Z. Gao, Eds., in Communications in Computer and Information Science, vol. 1241., Singapore: Springer Singapore, 2020, pp. 93\u0026ndash;105. doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/978-981-15-6318-8_9\u003c/span\u003e\u003cspan address=\"10.1007/978-981-15-6318-8_9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eA. Moubayed, M. Injadat, A. Shami, and H. Lutfiyya, \u0026ldquo;DNS Typo-squatting Domain Detection: A Data Analytics \u0026amp; Machine Learning Based Approach,\u0026rdquo; in 2018 \u003cem\u003eIEEE Global Communications Conference (GLOBECOM)\u003c/em\u003e, Dec. 2018, pp. 1\u0026ndash;7. doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/GLOCOM.2018.8647679\u003c/span\u003e\u003cspan address=\"10.1109/GLOCOM.2018.8647679\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQ. E. U. Haq, M. H. Faheem, and I. Ahmad, \u0026ldquo;Detecting Phishing URLs Based on a Deep Learning Approach to Prevent Cyber-Attacks,\u0026rdquo; \u003cem\u003eAppl. Sci.\u003c/em\u003e, vol. 14, no. 22, p. 10086, Nov. 2024, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3390/app142210086\u003c/span\u003e\u003cspan address=\"10.3390/app142210086\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eA. Aljofey, Q. Jiang, Q. Qu, M. Huang, and J.-P. Niyigena, \u0026ldquo;An Effective Phishing Detection Model Based on Character Level Convolutional Neural Network from URL,\u0026rdquo; \u003cem\u003eElectronics\u003c/em\u003e, vol. 9, no. 9, p. 1514, Sep. 2020, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3390/electronics9091514\u003c/span\u003e\u003cspan address=\"10.3390/electronics9091514\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eS. Hamadouche, O. Boudraa, and M. Gasmi, \u0026ldquo;Combining Lexical, Host, and Content-based features for Phishing Websites detection using Machine Learning Models,\u0026rdquo; \u003cem\u003eICST Trans. Scalable Inf. Syst.\u003c/em\u003e, Apr. 2024, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.4108/eetsis.4421\u003c/span\u003e\u003cspan address=\"10.4108/eetsis.4421\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eN. Innab \u003cem\u003eet al.\u003c/em\u003e, \u0026ldquo;Phishing Attacks Detection Using Ensemble Machine Learning Algorithms,\u0026rdquo; \u003cem\u003eComput. Mater. Contin.\u003c/em\u003e, vol. 80, no. 1, pp. 1325\u0026ndash;1345, 2024, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.32604/cmc.2024.051778\u003c/span\u003e\u003cspan address=\"10.32604/cmc.2024.051778\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJ. Chen, X. Wang, and F. Lei, \u0026ldquo;Data-driven multinomial random forest: a new random forest variant with strong consistency,\u0026rdquo; \u003cem\u003eJ. Big Data\u003c/em\u003e, vol. 11, no. 1, p. 34, Feb. 2024, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s40537-023-00874-6\u003c/span\u003e\u003cspan address=\"10.1186/s40537-023-00874-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Phishing, Cybersecurity, Machine Learning, Random Forest, Lexical Structure","lastPublishedDoi":"10.21203/rs.3.rs-9053171/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9053171/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003ePhishing is one most critical area in cybersecurity, which uses URLs to mislead users in revealing sensitive information like login credentials, financial information, and organizational sensitive data for malicious intent from the attackers. Blacklist, whitelist-based and heuristic approaches have effectively reduced the attack, but struggles when dealing with real-time and deceptive URLs. The study present Enhanced Random Forest as machine learning based framework that uses extensive lexical features for a phishing detection mechanism that produces high accuracy. The system is designed to extract feature from URL like digit frequency, number of dots, subdomains, suspicious keywords, and special characters to determine the phishing, while avoid relying on external services like domains, and webpage contents enabling lightweight and real-time detection. The system is trained with a comprehensive dataset of both phishing and legitimate URLs, and preprocessed to balance, encode the labels, and handle missing values. This enable the system to learn from unseen behaviour that may arise in the future. The process is optimized by tuning hyperparameters to minimize overfitting and enhance generalization. The enhanced ensemble model outperforms other machine learning classifiers based on the experimental evaluation by achieving high performance rate in term of accuracy, recall balance, and reduced high occurrence of false positive rate. The features has further influence lexical indicators by enhancing the interpretability in classification decisions. The result validate that when the ensemble machine learning is attached with strong lexical feature engineering, it will provide computationally efficient, and scalable solution that can be deployed on cybersecurity environments.\u003c/p\u003e","manuscriptTitle":"Enhance Random Forest Classifier for high accuracy URL phishing detection using lexical structure","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-20 03:13:33","doi":"10.21203/rs.3.rs-9053171/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"fffdb94d-2270-41d9-a8b6-45b78d089a18","owner":[],"postedDate":"March 20th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-03-20T03:13:33+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-20 03:13:33","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9053171","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9053171","identity":"rs-9053171","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-24T02:00:01.246996+00:00
License: CC-BY-4.0