Decoding Gender: A Machine Learning Approach for Classifying Indian Names with Advanced Feature Extraction

doi:10.21203/rs.3.rs-5897194/v1

Decoding Gender: A Machine Learning Approach for Classifying Indian Names with Advanced Feature Extraction

2025 · doi:10.21203/rs.3.rs-5897194/v1

preprint OA: closed

Full text JSON View at publisher

Full text 211,150 characters · extracted from preprint-html · click to expand

Decoding Gender: A Machine Learning Approach for Classifying Indian Names with Advanced Feature Extraction | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Decoding Gender: A Machine Learning Approach for Classifying Indian Names with Advanced Feature Extraction Sudeep D. Ghate, Saishma H, Dhanush Ghate D, Adithya M, Anjusha Alex, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5897194/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Classifying gender based on Indian names poses a unique challenge due to the nation's immense cultural, linguistic, and regional diversity. Existing methods often struggle to address the complexities of naming conventions shaped by religious, familial, and linguistic influences, resulting in inconsistent and inaccurate classifications. To address these challenges, this study developed a culturally diverse dataset of 31.3 lakh male and female names and leveraged advanced machine learning (ML) and deep learning (DL) techniques for gender classification. These names were sourced from Indian electoral data, synthetic names generated using custom scripts, and publicly available names from websites to ensure diversity. Twelve ML models were evaluated, with the top four - Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and XGBoost—prioritized for detailed analysis. CNN emerged as the best-performing model, achieving the highest accuracy (96%) and the fastest prediction time (5.61 seconds), highlighting its efficiency and ability to generalize across diverse naming conventions. LSTM and GRU also demonstrated strong performance, achieving accuracies of 95% and 93% respectively, with LSTM offering higher precision but significantly longer prediction times (50 seconds). XGBoost, a traditional ML model, achieved an accuracy of 86% but struggled with female name classification, indicating potential biases in feature representation. All models effectively captured complex naming patterns, though challenges such as the misclassification of unisex names and the underrepresentation of North-East Indian names in the dataset highlighted areas for improvement. This study underscores the advantages of deep learning models, particularly CNN, in leveraging hierarchical and sequential patterns in names for robust gender classification. However, limitations in dataset diversity and model generalizability indicate the need for further refinement. These findings contribute to advancing automated gender classification systems, offering practical applications in healthcare, marketing, and social sciences. Future work should focus on enhancing computational efficiency, expanding datasets to improve cultural inclusivity, and addressing biases to ensure equitable ML innovations. Artificial Intelligence and Machine Learning Gender classification Indian names Machine learning Feature extraction Deep learning Dataset diversity Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction India's cultural diversity creates significant challenges in gender classification based on names due to the country's diverse languages, religions, and naming conventions. Names in India are influenced by regional, religious, and familial factors reflecting distinct linguistic and cultural traditions. For instance, the same name may have different gender associations across regions or religions, adding complexity to automated classification systems (Tripathi et al., 2011; Sharma, 2005). Accurately determining gender from names is essential across various fields such as demography, healthcare, and marketing. In demography, gender classification supports accurate data analysis and policy development. In these fields, effective gender classification supports tasks like data segmentation, gender-specific research, and personalized services. Given India’s cultural and linguistic diversity, developing robust methods for gender classification becomes imperative (DeTienne et al., 2007). Machine learning (ML) and natural language processing (NLP) offer promising solutions by learning patterns from labeled datasets, capturing complex cultural and linguistic nuances. Unlike traditional statistical methods that often fail to account for India's naming intricacies, ML models leverage large datasets to better adapt to cultural diversity. They also help reduce biases inherent in models trained primarily on Western datasets, creating more inclusive systems (Hu et al., 2021; Ghosh et al., 2023). For instance, ML models can distinguish subtle patterns in suffixes, prefixes, or phonetics that align with gender-specific naming conventions. Despite progress, existing models frequently fall short in accuracy and inclusivity due to limited representation of Indian names in training datasets. Most prior research has either relied on Western-centric datasets or applied statistical methods, which overlook the cultural and linguistic diversity of Indian names, resulting in lower accuracy. This study addresses these gaps by developing predictive models using advanced ML techniques to improve gender classification from names. A key component of our approach is the creation of a comprehensive labeled dataset, ensuring representation across India's varied cultural and linguistic contexts. By exclusively focusing on Indian names, this study enhances gender classification for a culturally complex demographic and sets a stage for integrating diverse cultural perspectives into ML applications. While acknowledging that gender identity is not strictly binary, this study approaches gender inference as a binary classification task for practical applications such as policy development and data analytics (Radhakrishnan, 2011). Our research aims to advance automated gender classification systems by improving accuracy and cultural sensitivity. This contributes to more effective data analysis and informed decision-making across domains like healthcare, marketing, and social sciences, particularly in contexts requiring gender-segmented analysis. Despite advancements in machine learning, several open questions remain regarding the classification of gender from Indian names. For instance, how effectively can machine learning models address the linguistic and cultural complexities inherent in Indian names? Can training on a culturally representative dataset mitigate biases and enhance accuracy across India’s diverse naming conventions? And, to what extent do such models generalize across regions with distinct linguistic influences? We hypothesized that a comprehensive, labeled dataset of Indian names, combined with advanced machine learning techniques, could significantly improve the accuracy and inclusivity of gender classification. To explore this, we developed a culturally diverse dataset of Indian names and trained machine learning models capable of capturing intricate patterns unique to Indian naming conventions. Our findings offer valuable insights into the design of culturally sensitive machine learning systems, with practical implications in areas such as human-computer interaction, marketing intelligence, and social analytics. Related works Many studies have employed ML strategies for named entity recognition and gender classification highlighting their effectiveness across various languages and datasets (Table 1 ). Traditional methods, such as Naive Bayes, have been widely used. For example, a Naive Bayes system achieved 77.2% accuracy in classifying Kannada names (Amarappa & Sathyanarayana, 2015). Logistic regression has also been explored for name-based gender classification, achieving 83.7% accuracy on Indonesian names, which improved to 98.6% when combined with Convolutional Neural Network (CNN) analyzing profile photos (Manik et al., 2019). Deep learning approaches have shown significant promise; viz embedding Chinese names with the Pinyin approach using the BERT model achieved 95% accuracy, outperforming Naive Bayes and Gradient Boosting methods (Sun et al., 2021). Similarly, Dual- Long Short-Term Memory (LSTM) models effectively classified genders in a dataset of 21 million first names (Hu et al., 2021), while neural networks like MLPs and Gated Recurrent Units (GRU) consistently achieved over 90% accuracy on Brazilian names (Rego & Silva, 2021). A study on Bangladeshi names achieved a peak accuracy of 92.16% using Conv1D models, addressing challenges posed by unisex names and proposing future directions for improvement (Kabir et al., 2022). Table 1 Comprehensive summary of related works on name-gender classification Article Origin of Names Language Processing Methods Dataset and Results Amarappa and Sathyanarayana (2015) NA Kannada Naive Bayes classifier 10-fold cross-validation accuracy of 77.2% for classifying Kannada names. Jia and Zhao (2019) Chinese Chinese Logistic regression on written and spoken names, BERT with Pinyin embedding Accuracy of 93% Manik et al. (2019) Indonesian Indonesian Logistic regression on names, CNNs on profile photos Accuracy of 83.7% for names, 98.6% for names and photos Hu et al. (2021) USA English Character-based machine learning models such as LSTM and BERT. A dataset of 21 million unique first names from SSA and YAHOO data. LSTM and BERT-based different models obtained approximately 87% and 88% accuracy, respectively. Rego and Silva (2021) Brazilian English DNN-based models, including MLP, RNN, GRU, CNN, and Bi-LSTM. The dataset consists of 100,787 Brazilian names, of which 54.82% are female names and 45.18% are male, based on 2010 CENSO data. CNN, Bi-LSTM, RNN, MLP, and GRU-based models achieved 92%, 95%, 93%, 86%, and 94% accuracy, respectively. Panchenko et al. (2014) Russian English Statistical models based on one type of features: endings, character trigrams, and dictionary. A dataset of 100,000 Russian full names from Facebook. Accuracy of upto 96% is obtained from a combined model: endings + 3-g + dictionary. Tripathi and Faruqui (2011) Indian English Morphological features and n-gram suffixes with SVM classification A dataset of 2000 Indian names in Gujrati, Tamil, Telugu, Hindi, Urdu, Bengali and Tamil names. The training dataset had 890 female and 1110 male names. This model achieved the maximum F1 score at 94.9%. Kabir et al. (2022) Bangladesh Bangla Conv1D, LSTM, BiLSTM, and Stacked LSTM models were used for gender recognition, with tokenized characters converted into numerical embeddings and optimized through hyperparameter tuning. A dataset of 2,030 unique Bangladeshi names. The Conv1D model achieved the highest accuracy at 91.18%. Several studies have explored Indian names specifically to classify them according to gender. Morphological traits and n-gram suffixes improved SVM-based classification (Tripathi & Faruqui, 2011). A character-level BiLSTM model with a conditional random field achieved 94% accuracy in segmenting Indian author names, demonstrating the effectiveness of deep learning for handling diverse naming conventions. (Santosh et al., 2020). Simpler models, such as the SimpleText model, have also performed well, achieving 94.67% accuracy on a dataset of 84,899 names, comparable to more complex architectures like LSTMs (Ghosh, 2021). These studies underscore the potential of machine learning for name-based gender classification while highlighting challenges such as linguistic diversity, unisex names, and cultural nuances. Building on these insights, our study focuses on leveraging advanced machine learning techniques with a comprehensive dataset of Indian names to address these complexities effectively. Methodology Data Collection and Preparation The electoral roll data was manually downloaded from the Election Commission of India (ECI) website ( https://voters.eci.gov.in/download-eroll , last accessed on Nov 5, 2024). Each state’s data was available either in English or a regional language, with preference given to English files. The PDF files contained structured information for each individual, including name, relative’s name, address, and gender. However, as the name data was often in image format, we utilized optical character recognition (OCR) with Pytesseract (Lefèvre et al., 2016) and PyPDF2 (Fenniak et al., 2022) for text extraction. The text extraction process involved converting each page of the PDF to an image, enhancing image quality (contrast adjustments, noise reduction), and applying OCR. Irrelevant pages were skipped, and regular expressions were employed to identify name fields and gender-related information. Names extracted from regional language PDFs were translated to English using Google Translate to ensure uniformity. A total of 150–200 files from each state were processed based on the population share of the state. Names with missing or ambiguous gender information were excluded. The results were stored in dictionaries, and the final dataset was saved as CSV files for subsequent analysis. The details of names extracted from each state are given in supplementary Table S1. Challenges included regional language barriers and inconsistencies in electoral roll maintenance across states. The regional language entries pose a significant barrier to uniformity in electoral rolls. For instance, while names in South Indian states (except Kerala), Delhi, Jammu and Kashmir, and the northeastern states were often available in English, those in Assam, Tripura, Gujarat, Odisha, Punjab, and Kerala were recorded exclusively in regional languages. These discrepancies complicated the data extraction process. Additionally, gender entries in some records used non-standard terms like "boy" or "girl," requiring manual resolution to maintain consistency. Dataset Augmentation To enrich the dataset, additional male and female names were extracted from various websites using the Beautiful Soup Python library (Richardson, 2007) for web scraping. This process emphasized culturally and religiously diverse names to reflect the demographic variety of the population. The sources and counts of names from different websites are detailed in Supplementary Table S2. Further, insights from the literature (Singh KS, 1996; Sharma, 2005) guided the generation of a comprehensive list of first and surnames across states, ensuring representation of religions, cultures, tribes, and communities. Custom Python scripts generated random combinations of first and last names, including suffixes and tribal names where applicable. Details of synthetically generated names are provided in Supplementary Table S3. To facilitate broader usability, the scripts developed for text extraction, name synthesis, and data processing were consolidated into a standalone Python package. The package’s current version and documentation are available at [ https://pypi.org/project/indigen/ ]. Data Cleaning and Preprocessing The raw data from electoral name files underwent a comprehensive cleaning process to ensure uniformity and relevance for further analysis. Punctuation, honorifics, and titles (e.g., "Mr.," "Mrs.," "Miss," "Ms.") were removed, and initials or words shorter than three characters were eliminated. Names were standardized by removing special characters, converting them to lowercase, and trimming extraneous spaces. These steps ensured that only relevant and valid names remained for further analysis. An iterative filtering process was applied to exclude names with fewer than four characters, more than three words, or overly complex structures. Additionally, uncleaned entries (e.g., short or overly complex names) were stored separately for further review or reprocessing. Rows with missing values were dropped and gender labels were standardized by converting them to lowercase and replacing variations like 'woman', 'women', 'man', and 'men' with 'female' and 'male'. Instances of names containing specific substrings were analyzed to check for contextual inconsistencies in gender labeling. For names where female related substring was present but not as the first word, gender labels were updated to 'Female' based on cultural conventions of the state after manual verification. Names classified as artifacts (e.g., containing repeated letters such as 'aaa' or special characters) were identified and removed. Periods, single-letter entries, and unwanted characters such as ‘?, *, or #’ were also cleaned using regex-based transformations. Finally, all names were converted to title case for uniformity. A custom script was developed to clean Hindi names from ‘Hindi language electoral data states’ that were incorrectly translated into English while using Google translate. To effectively manage this, we utilized the NLTK library ( https://www.nltk.org/ ) [NLTK Project. (2001–2024)]. Using the NLTK library, a set of valid English words was created and cross-checked against translated names. A custom exclusion list of valid Indian names was also applied to identify and remove inconsistencies while avoiding the removal of legitimate Indian names. Names incorrectly flagged as English words but unrelated to valid translations were identified and exported for manual verification and cleaning. After merging the website and synthetic names, duplicate removal was performed by grouping name and gender to obtain the final non-electoral dataset which was then merged with cleaned data from the electoral site to form the final dataset for machine learning. Feature engineering A comprehensive feature engineering strategy was employed to transform the raw text data (names) into meaningful representations for machine learning models. For traditional machine learning algorithms such as KNN, CatBoost, Decision Tree, GRU, Logistic Regression, MLP, Random Forest, and XGBoost, vectorization techniques like Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) (Sammut and Webb, 2011) and character bigrams (Cavnar et al., 1992) were utilized to capture both frequency-based and sequential patterns in the data. To address the issue of high dimensionality, Singular Value Decomposition (SVD) (Zhang, 2015) was applied to these vectorized features, reducing them to 100 components each. Additionally, manual features such as name length, vowel counts, and phonetic representation lengths (derived using the Double Metaphone algorithm) were incorporated to highlight linguistic and cultural trends. These features were normalized and combined into a unified feature matrix. Finally, Principal Component Analysis (PCA) was applied to the combined matrix, retaining 95% of the variance for efficiency. For models leveraging BERT-based feature engineering, tokenized names were processed through a pre-trained DistilBERT transformer model to generate embeddings that captured semantic and contextual information. These embeddings were reduced to 128 components using PCA and subsequently integrated with the traditional and manual features to provide a robust input for machine learning algorithms. For deep learning models such as LSTM, CNN, and GRU, a distinct feature engineering approach was adopted to capture sequential character-level patterns. Names were tokenized at the character level, and the resulting sequences were padded to a uniform maximum length to ensure consistency across inputs. In addition to the sequential data, auxiliary features like name length were extracted and processed separately. The padded sequences, along with the auxiliary features, were fed into the models, enabling them to capture both sequential patterns and supplementary linguistic information. The target variable, gender, was label-encoded to ensure compatibility with neural network architectures. The processed features were then split into training and testing datasets to facilitate model evaluation and performance benchmarking. The final feature matrix, comprising dimensionality-reduced embeddings, manually engineered attributes, and interaction terms, provided a balanced and enriched representation of the dataset. This matrix served as a unified input for machine learning models, allowing them to effectively leverage both linguistic and statistical nuances for enhanced classification performance. Model selection Twelve pre-trained deep learning models - KNN, CatBoost (Prokhorenkova et al., 2018), CNN (LeCun et al., 1998), Decision Tree (Ryzin, 1986), GRU (Cho, 2014), Logistic Regression (Trendowicz et al., 2014), Logistic Regression with BERT, LSTM (Hochreiter & Schmidhuber, 1997), MLP, Random Forest (Breiman, 2001), XGBoost (Chen, 2016), and XGBoost with BERT were adapted to classify gender based on names. These models were selected for their proven effectiveness across various classification tasks ensuring suitability for our specific application. K-Nearest Neighbors (KNN) was chosen for its simplicity and interpretability providing a foundational comparison point. CatBoost, known for handling categorical data efficiently, was selected for its strength in preventing overfitting. CNN excel in capturing spatial hierarchies within name sequences, while GRU and LSTM models were chosen for their ability to handle sequential dependencies in names. Logistic Regression served as a basic model for comparison and combining it with BERT (Kenton, 2019) provided enhanced contextual understanding. Random Forest and XGBoost were included for their ensemble learning capabilities and speed with XGBoost with BERT combining the strengths of both to capture deeper patterns in names. Each model was selected to address the unique complexities of name-based gender classification ensuring high accuracy and cultural sensitivity. Hyperparameter tuning Hyperparameter tuning for all models was conducted using Bayesian Optimization via Optuna, with a custom implementation employing a 5-fold cross-validation approach over 30 trials. For GRU, LSTM, and CNN models, the tuning process included 20 epochs of training with early stopping, configured with a patience of 3 to halt training if no improvement in performance was observed. 10% of the original data was used for this initial screening. The dataset was split into training and evaluation sets using a stratified split, with 70% of the data allocated for training and 30% for evaluation. Validation accuracy from each fold served as the optimization metric. Model architecture The architectures for the LSTM, GRU, and CNN models were designed to process sequential data alongside additional features. Each model utilized an embedding layer to process input sequences of characters, followed by either a recurrent or convolutional layer. The GRU model employed a Bidirectional GRU layer, where the number of units and dropout rates were tuned during the hyperparameter optimization process. The output from the GRU layer was then concatenated with an additional input feature and passed through a dense layer with ReLU activation and L2 regularization, with the final output produced by a sigmoid activation function for binary classification. The CNN model followed a similar input structure, using an embedding layer followed by a 1D convolutional layer where tunable filters and kernel sizes were used. After pooling and flattening the feature maps, these were concatenated with the additional input feature, followed by dense layers with ReLU activation, dropout for regularization, and L2 regularization, before outputting the final binary classification result via a sigmoid activation. The LSTM model mirrored the GRU model, using a Bidirectional LSTM layer instead, with similarly tuned dropout rates and recurrent dropout. The output from the LSTM layer was concatenated with the additional feature passed through a dense layer with ReLU activation and L2 regularization, concluding with a sigmoid activation. These architectures were optimized to handle sequential data and additional features, allowing the models to learn effectively from complex patterns. Final training, validation and testing Once the optimal hyperparameters were determined, each model was initially trained on a 10% subset of the training dataset using these settings. After tuning, the models were evaluated on the test dataset to assess their generalization performance. Metrics such as accuracy, classification reports, and confusion matrices were calculated to quantify the reliability and effectiveness of the models. From the list of 12 models, the best 4 were selected based on accuracy and other performance metrics, and these models were retrained using the full training dataset. A patience of 8 and 100 epochs were used for this final training phase. For XGBoost, since epochs cannot be used, hyperparameter tuning was performed again with the full dataset across 50 trials and the best parameters were used for the final model. The final models were saved as .h5 files and testing was performed on the full test dataset. For the final evaluation, the time taken for prediction, confusion matrix, precision, recall, F1-score, and support were calculated for the selected models. Additionally, training vs. validation loss and training vs. validation accuracy were plotted to visualize the convergence and performance trends. Hardware specifications The experiments were performed on a system with 32GB of RAM and an NVIDIA RTX 3060 GPU with 12GB of memory providing sufficient computational resources for the training and evaluation of the deep learning models. The models were trained using Tensorflow (Abadi et al., 2016) and the experiments were executed within the Jupyter Notebook to facilitate iterative development and evaluation. Scripts are available in the accompanying GitHub repository for reproducibility and further exploration.The following libraries were utilized in the analysis: Python 3.10, Tensorflow pandas, matplotlib (Hunter et al., 2007), seaborn (Waskom M et al., 2020), SciPy (Virtanen P et al., 2020), and numpy. Data Access and Ethical Considerations We adhered to all applicable guidelines and permissions governing the use of electronic records, ensuring compliance with data protection policies. Data extraction was limited to non-identifiable information (names and general demographic indicators) and conducted responsibly following approval from the Institutional Scientific and Ethical Board (NU/CEC/2024/723 study number CEC/275 dt. 15.11.2024). To safeguard privacy, all data was anonymized upon extraction, retaining only names for model training. Sensitive information such as addresses or personal identifiers was not collected. Following model training, the raw electoral roll data was permanently deleted and only aggregated results and statistical summaries were retained. Access to the data was restricted to authorized study personnel and all procedures complied with Institutional Ethical Committee guidelines and the Information Technology Act of India, emphasizing our commitment to data protection and privacy. Results A flowchart illustrating the process of data collection, cleaning and classification of gender based on names has been provided (Fig. 1 ). The names were extracted from electoral rolls, additional names were gathered by web scraping from various name websites, and synthetically generated names were also added to enrich the dataset. Table 2 summarizes the dataset used for name-gender classification. It shows the number of male and female names in each stage of data collection and processing, starting with raw electoral data and ending with the final test set. The dataset includes both real-world data from electoral rolls and websites, as well as synthetically generated names to balance the dataset. The combined dataset for name-gender classification includes 15,74,573 male names and 15,56,468 female names, totaling 31,31,042 names, after incorporating cleaned electoral data, names from websites, and synthetically generated names. Data cleaning was conducted to eliminate irrelevant characters and maintain the integrity of the names retained for analysis. Advanced feature engineering techniques mentioned in Table 3 such as Bag-of-Words (BoW), Word2Vec, Character n-grams, and Term Frequency-Inverse Document Frequency (TF-IDF) transformed raw text data into significant representations for machine learning applications. Dimensionality reduction methods such as SVD and PCA on BERT to reduce feature space and enhance model performance. Utilizing a balanced dataset of names classified by male and female gender, we systematically investigated various deep learning architectures and machine learning algorithms mentioned in Table 4 , including CNN, GRU, LSTM, and XGBoost to determine the most effective model for this classification task. Ten percent of the dataset was allocated for training and testing purposes. Graphs showing training and validation accuracy and loss over iterations were generated during the hyperparameter tuning process for all models. These figures (Fig. S1 and Fig. S2) illustrate the convergence behavior and performance trends. The top-performing models are selected based on their validation performance and retrained on the entire dataset (training + validation) to enhance generalization. The final performance is then evaluated on a separate test dataset, and the model with the best test set performance is chosen for name-gender classification. Table 2 Distribution of male and female names in the name-gender classification dataset Dataset Male names Female names Total Electoral data before cleaning 12,57,740 21,22,284 33,80,024 Electoral data after cleaning 10,87,130 9,45,976 20,33,075 Websites 59,766 53,781 1,13,547 Synthetic (before duplicate removal) 10,57,074 12,13,053 22,70,127 Synthetic (after duplicate removal) 4,87,443 6,10,492 10,97,967 Final combined and cleaned training data 15,74,573 15,56,468 31,31,042 Final test dataset 1,23,569 1,17,300 2,40,859 Table 3 Feature engineering techniques used for name-gender classification: Advantages and Trade-offs Technique Description Advantages for Name Classification Limitations Role in This Work Bag-of-Words (BoW) Converts names into vectors of word counts or frequencies. Simple, fast, and effective for capturing frequent name patterns. Ignores order, context, and phonetics, limiting nuanced recognition. Captured basic frequency-based name patterns. TF-IDF Weighs names by term importance in the dataset. Highlights distinctive name components and rare patterns. Ignore sequence, semantic relationships, and name structure. Enhanced feature representation for name uniqueness. Character Bigrams Splits names into overlapping two-character sequences. Captures character-level patterns in prefixes, suffixes, and roots. High dimensionality; sparse representations can lead to overfitting. Identified morphological trends like prefixes/suffixes. Singular Value Decomposition (SVD) Reduces dimensionality of BoW, TF-IDF, and n-grams features. Improves efficiency while retaining key patterns in names. Loss of interpretability and fine-grained details. Reduced vectorized features to 100 components. Additional Features Length of names, vowel counts, and phonetic lengths. Encodes intuitive patterns, e.g., short male names or vowel-rich female names. Limited predictive power if used in isolation. Highlighted linguistic trends in names. Double Metaphone Encodes names into phonetic representations. Groups similar-sounding names, Can oversimplify phonetic variations in complex names. Captured phonetic similarity trends in gendered names. Principal Component Analysis (PCA) Reduces dimensions of combined feature matrices. Improves computational efficiency; focuses on variance. May lose semantic and contextual nuances. Retained 95% variance for efficiency. DistilBERT Embeddings Contextual semantic embeddings from pre-trained transformer. Captures cultural, regional, and contextual name patterns. Computationally intensive; may need fine-tuning for specific datasets. Provided robust, contextual name representations. Tokenized Character Sequences Converts names into uniform-length character sequences. Preserves order, enabling models to learn sequential patterns. Sequence padding/truncation can affect input consistency. Input for neural models like CNN, GRU, and LSTM. Label Encoding Encodes gender (target) numerically for models. Simple and compatible with machine learning models. Doesn't capture complex target relationships (e.g., non-binary genders). Encoded binary gender labels. Table 4 Strengths and weaknesses of models used in this study Model Strengths Weaknesses CNN Excellent for learning local patterns in character sequences. Requires large amounts of labeled data and may struggle with long-range dependencies. Decision Tree Highly interpretable and easy to visualize. Prone to overfitting, especially with small datasets. GRU Efficient at capturing sequential dependencies. Struggles with learning very long-term dependencies. KNN Simple and interpretable, effective for small datasets. Computationally expensive and less suitable for large datasets. Logistic Regression Simple, interpretable, and fast for binary classification. Assumes linear relationships, struggles with complex or nonlinear patterns. Logistic Regression with BERT Combines simplicity with contextualized feature extraction. Computationally expensive due to embedding generation. LSTM Captures long-range dependencies in character sequences. Complex, slow to train, and resource-intensive. MLP Versatile and effective for structured inputs like feature matrices. Prone to overfitting without regularization. Random Forest Robust and reduces overfitting by averaging multiple decision trees. Less interpretable than single decision trees. XGBoost Highly efficient, scalable, and fast with boosting techniques. Sensitive to hyperparameter tuning, which can affect performance. XGBoost with BERT Combines contextual feature extraction with powerful boosting. Resource-intensive and requires careful optimization. Justification for choosing the model Twelve algorithms were selected to represent a diverse range of techniques. Table 4 summarizes the weaknesses and key strengths of these machine learning models, providing a comparative overview. These included parametric and non-parametric approaches, linear and non-linear models, and ensemble methods. This diversity makes them suitable for handling various types of data and classification problems. Random Forest is an ensemble method that excels at handling large datasets with many features, reducing overfitting and improving generalization for both classification and regression tasks (Iranzad & Liu, 2024). For high-dimensional classification tasks, Support Vector Classification (SVC) is effective in handling complex nonlinear relationships and clear decision boundaries (Zou et al., 2023). For classifying data based on the majority class of its nearest neighbors, K-Nearest Neighbors (KNN) makes it ideal for tasks like image classification and recommendation systems (Syriopoulos et al., 2023). Gaussian Naive Bayes (GaussianNB) assumes feature independence and a Gaussian distribution, making it efficient for high-dimensional data and tasks like spam filtering and sentiment analysis (Rehman et al., 2019; Tüfekci & Bektaş, 2022). Decision Tree Classifier partitions feature space to create human-readable decision rules but may overfit with noisy data (Rehman et al., 2019). The MLP Classifier, a neural network model, can learn complex nonlinear patterns, making it suitable for tasks like image classification and natural language processing (D’Silva & Sharma, 2022). XGBoost, an ensemble method combining decision trees via gradient boosting, is known for high performance with large datasets (Ghosh, 2021). Alternatively, CatBoost handles categorical features effectively by using a unique encoding algorithm (Qureshi & Sabih, 2021). GRUs and LSTM networks are RNNs designed for sequence prediction, with GRUs excelling in time-series and NLP tasks, and LSTMs capturing long-range dependencies for applications like speech recognition and text generation. Lastly, 1D Convolutional Neural Networks (1D CNNs) are specialized for processing sequential data, particularly effective in time-series analysis and signal processing by learning spatial hierarchies of features. Hyperparameter tuning Optimal hyperparameters were identified through a comprehensive tuning process over 30 epochs, ensuring each model achieved its best performance. The variations of the hyperparameters selected for tuning are detailed in Table 5 . The final configurations for each machine learning model, showcasing the parameters that produced the most accurate results, were determined through an extensive hyperparameter search (Table 6 ). Key hyperparameters such as neighbour count, regularization strength, learning rate, batch size, dropout rate, and L2 regularization were extensively tuned. CatBoost stood out with a score of 0.7997, with L2 leaf regularization (‘l2_leaf_reg’) was set to 5.84, alongside a learning rate of 0.0959. MLP achieved 0.8054 by configuring hidden layers and regularization parameters to enhance generalization. Log-BERT and XGBoost both achieved a leading score of 0.8912, with a gamma value for regularization of 1.3797, and a learning rate of 0.2290 demonstrating their robustness in handling complex datasets. LSTM and GRU models achieved 88.5% and 88.1% validation accuracy, respectively, with carefully tuned batch sizes, units, regularization and learning rates 0.0033 and. The CNN achieved a validation accuracy of 84.6% with optimal hyperparameters that included a batch size of 32. The dropout rate was maintained at 20%, with an L2 regularization value of 7.45e-6 and a learning rate of 0.00162. These configurations contribute to the CNN’s performance in classification tasks. With the best hyperparameters, the GRU model had an 88.1% validation accuracy. These were a batch size of 32, GRU units set to 128, a 20% dropout rate, and 58 dense units. The L2 regularization was set to 1.53e-6, with a learning rate of 0.00093. These parameters enhance the GRU’s ability to capture sequential dependencies. These results demonstrate the importance of selecting appropriate hyperparameters for achieving optimal model performance across various algorithms. Table 5 Hyperparameters used for model optimization with range and description Model Hyperparameter Range/Values Description Logistic Regression C, solver [0.001, 0.01, 0.1], ['saga', 'lbfgs'] Regularization strength; optimization algorithm. max_iter, penalty [1000, 2000, 3000], ['l2', 'l1'] Max iterations; regularization type. Random Forest n_estimators, max_depth [100, 200], [None, 10, 20] Number of trees; tree depth. KNN n_neighbors [ 3 , 5 , 7 ] Number of nearest neighbors. Decision Tree max_depth, min_samples_split [None, 10, 20], [ 2 , 5 ] Tree depth; min samples for splitting. MLP hidden_layer_sizes, alpha [(50,), (100,), (50, 50)], [0.0001, 0.001] Hidden layers; regularization strength. max_iter, learning_rate_init [1000], [0.0001, 0.001] Max iterations; initial learning rate. activation, early_stopping, n_iter_no_change ['relu'], [True], [ 10 ] Activation function; early stopping settings. XGBoost max_depth, learning_rate (3, 10), (0.01, 0.3) Tree depth; step size shrinkage. n_estimators, gamma (50, 500), (0, 5) Boosting rounds; min loss for splits. min_child_weight, subsample, colsample_bytree (1, 10), (0.5, 1.0), (0.5, 1.0) Min child weight; sample fractions. CatBoost iterations, depth [100, 200], [ 3 , 5 , 7 ] Boosting iterations; tree depth. learning_rate [0.01, 0.1] Step size shrinkage. GRU gru_units, dense_units [32, 128] (step 16), [10, 64] (step 16) GRU units; dense layer size. dropout_rate, recurrent_dropout_rate [0.1, 0.5] (step 0.05), [0.0, 0.4] (step 0.05) Regularization dropouts. learning_rate, batch_size, l2_reg [1e-4, 1e-2] (loguniform), [32, 64, 128], [1e-6, 1e-3] (loguniform) Optimization settings. CNN filters, kernel_size [64, 256] (step 32), [ 3 , 7 ] (step 2) Conv1D filters; kernel size. dense_units, dropout_rate [16, 128] (step 16), [0.1, 0.5] (step 0.05) Dense layer size; regularization dropout. learning_rate, batch_size, l2_reg [1e-4, 1e-2] (loguniform), [32, 64, 128], [1e-6, 1e-3] (loguniform) Optimization settings. LSTM lstm_units, dense_units [32, 128] (step 16), [10, 64] (step 16) LSTM units; dense layer size. dropout_rate, recurrent_dropout_rate [0.1, 0.5] (step 0.05), [0.0, 0.4] (step 0.05) Regularization dropouts. learning_rate, batch_size, l2_reg [1e-4, 1e-2] (loguniform), [32, 64, 128], [1e-6, 1e-3] (loguniform) Optimization settings. Table 6 Best Hyperparameters for each model after hyperparameter tuning by bayesian optimization Model Best Hyperparameters CatBoost border_count: 113.572, depth: 10.763, iterations: 377.451, l2_leaf_reg: 5.840, learning_rate: 0.096 CNN 'batch_size': 128, 'filters': 256, 'kernel_size': 7, 'dense_units': 112, 'dropout_rate': 0.2, 'l2_reg': 4.995035980196417e-06, 'learning_rate': 0.0004955676611313104 Decision Tree ccp_alpha: 0.005, criterion: 0.742, max_depth: 18.752, min_samples_leaf: 46.240, min_samples_split: 11.054, splitter: 0.198 GRU batch_size': 32, 'gru_units': 64, 'dropout_rate': 0.2, 'recurrent_dropout_rate': 0.0, 'dense_units': 16, 'l2_reg': 1e-4, 'learning_rate': 0.0003} KNN (n_neighbors = 19, p = 1, weights='distance') Logistic Regression C: 2.124, penalty: 0.020, solver: 0.970 Logistic regression with BERT C: 1.777, penalty: 0.720, solver: 0.128 LSTM best_params = {'batch_size': 128, 'lstm_units': 112, 'dropout_rate': 0.25, 'recurrent_dropout_rate': 0.1, 'dense_units': 26, 'l2_reg': 0.0004702818029667774, 'learning_rate': 0.0015 MLP activation: 0.769, alpha: 0.078, batch_size: 159.662, hidden_layer_sizes: 30.245, learning_rate: 0.886, learning_rate_init: 0.064, solver: 0.042 Random forest criterion: 0.034, max_depth: 14.093, max_features: 0.455, min_samples_leaf: 6.962, min_samples_split: 4.493, n_estimators: 102.007 XGBoost colsample_bytree: 0.846, gamma: 1.380, learning_rate: 0.229, max_depth: 9.390, min_child_weight: 1.392, n_estimators: 458.533, subsample: 0.899 XGBoost with BERT colsample_bytree: 0.546, gamma: 1.974, learning_rate: 0.149, max_depth: 7.704, min_child_weight: 6.886, n_estimators: 457.855, subsample: 0.919 Model training and validation The final four models: LSTM, GRU, CNN, and XGBoost were examined in detail for their training and validation performance. The final four models - LSTM, GRU, CNN, and XGBoost - were examined in detail for their training and validation performance. While LSTM, GRU, and CNN demonstrated a good fit with consistent alignment between training and validation losses, XGBoost showed signs of overfitting, with validation loss diverging from training loss as training progressed. To train the XGBoost model, we utilized stratified data to maintain class balance and prevent bias during training. For the CNN model, both training and validation accuracies increased steadily, eventually plateauing at 0.94 and 0.92, respectively, after 30 epochs (Fig. 2 A). The corresponding training and validation losses decreased consistently, stabilizing around 0.25 and 0.26 (Fig. 3 A). The GRU model showed rapid improvements in both training and validation accuracies, which reached plateaus at 0.94 and 0.92 after approximately 20 epochs (Fig. 2 B). Similarly, training and validation losses initially declined before leveling off at 0.24 and 0.25, respectively (Fig. 3 B). For the LSTM model, training and validation accuracies exhibited a steady increase, plateauing at 0.93 and 0.92 after about 30 epochs (Fig. 2 C). Training and validation losses decreased gradually and stabilized at 0.23 and 0.24 (Fig. 3 C). In the case of XGBoost, training accuracy rose rapidly, peaking at 0.995 after 10 iterations, while validation accuracy fluctuated slightly before stabilizing around 0.93 (Fig. 2 D). Training loss declined sharply, reaching a plateau at 0.01 after 10 iterations, whereas validation loss showed some variability but eventually settled around 0.2 (Fig. 3 D). While XGBoost achieved the lowest training loss and highest training accuracy but exhibited a larger gap between training and validation performance, indicating a risk of overfitting. In contrast, CNN, GRU, and LSTM maintained more balanced performance between training and validation, indicating superior generalization. Model testing The performance metrics for the 12 models trained initially on 10% training data on the test dataset is provided in Table 7 . The results highlight that, KNN, while simple and interpretable, struggled with capturing complex patterns, reflected in its lower performance compared to the deep learning models. Random Forest showed balanced precision and recall but lacked the higher accuracy seen in the top performers, whereas Logistic regression, despite being one of the simpler models, still outperformed some other traditional models, but its linearity assumption limited its ability to handle complex patterns effectively. The Decision Tree model was notably the weakest, with poor generalization and high sensitivity to noise, as seen in its high false positive and false negative rates. Logistic regression with BERT and XGBoost with BERT embeddings achieved an accuracy of 82% and 81.94%. While BERT embeddings improved feature representation, the model's performance did not significantly surpass traditional machine learning models, suggesting that BERT’s contextual embeddings, while useful, did not offer substantial benefits for gender classification tasks based on names. When BERT embeddings were combined with XGBoost, there were no significant improvements in accuracy. Performing well with categorical data, CatBoost, similar to XGBoost, provided robust results without overfitting, although it did not surpass XGBoost in precision or recall. Table 7 Performance metrics for the 12 ML models (10% training data) on the test dataset. Model/Metrics Precision Recall F1-Score Accuracy CatBoost 0.87 0.87 0.87 0.87 CNN 0.93 0.93 0.93 0.93 Decision Tree 0.72 0.71 0.71 0.71 GRU 0.93 0.93 0.93 0.93 KNN 0.86 0.86 0.86 0.86 Logistic Regression 0.85 0.85 0.85 0.85 LogReg-BERT 0.82 0.82 0.82 0.82 LSTM 0.93 0.93 0.93 0.93 MLP 0.86 0.86 0.86 0.86 Random Forest 0.85 0.85 0.85 0.85 XGBoost 0.87 0.87 0.87 0.87 XGBoost with BERT 0.82 0.82 0.82 0.82 The four best test models were trained on 100% data and tested, underscoring their ability to allocate gender based on names. The model performance data from the testing experiments reveals significant insights into the effectiveness of various machine learning approaches for gender classification tasks in Table 8 . The accuracies ranged from 81.97–93.37% for all the models.The LSTM model achieved the highest accuracy of 93.37%, with balanced precision, recall, and F1 scores for both male and female names. This model required nearly 50 seconds for predictions. Its strength lies in capturing temporal dependencies and complex sequential patterns, contributing to consistent predictions with minimal bias, as shown in the confusion matrix. The CNN model closely followed with an accuracy of 92.99%, with fastest prediction time at just 5.61 seconds. It effectively extracted features from name sequences, capturing spatial hierarchies and complex patterns. However, the confusion matrix indicated slightly higher misclassification rates compared to LSTM, particularly for names with ambiguous gender associations. The GRU model achieved an accuracy of 92.65%, while taking 10.44 seconds for predictions, slightly behind LSTM and CNN. It performed well across both genders, with a precision of 0.94 and recall of 0.90 for male names, and a precision of 0.92 and recall of 0.93 for female names, showing a balanced distribution of false positives and false negatives across genders. GRU’s ability to capture sequential dependencies, similar to LSTM. whereas,XGBoost,a traditional machine learning model, achieved an accuracy of 87.24%.While it did not surpass the deep learning models, it showed strong performance, especially for male names with a precision and recall of 0.87. For female names, the precision was 0.86, and recall was 0.87, yielding an F1 score of 0.86 for both genders.While XGBoost demonstrated reasonable predictive performance, it took 8.48 seconds to make predictions. Final evaluation metrics are presented in Table 8 , showcasing their performance across diverse scenarios. Table 8 Performance metrics for the final models trained on full data. Models/metrics Accuracy Precision F1 score Recall Time take for test results Male Female Male Female Male Female XGBoost 0.86 0.87 0.86 0.95 0.88 0.86 0.87 8.46s GRU 0.93 0.95 0.95 0.95 0.95 0.95 0.95 10.44s LSTM 0.95 0.95 0.95 0.95 0.95 0.95 0.95 49.99s CNN 0.96 0.95 0.95 0.95 0.95 0.95 0.95 5.61s Confusion matrix The confusion matrix evaluates classification performance by identifying true positives, false positives, true negatives, and false negatives.This analysis plays a crucial role in measuring model accuracy and reliability. For model evaluation, the performance metrics accuracy, precision, recall, and F1-score were analyzed following the methodology of Ghate et al., 2024. Figure S3 presents the confusion matrices for the initial 12 models on 10% data, highlighting their performance on male and female name classification.While models including Log-BERT and XGBoost showed balanced classification, while the poorest overall performance was demonstrated by Decision Tree model showing the highest misclassification rate for female names. Based on these results, the top four performing models were selected for further testing and comparison was provided as confusion matrices (Fig. 4 ). Each model's accuracy, precision, recall, and confusion matrix provide insights into their effectiveness in predicting class labels. The XGBoost model achieved an accuracy of 0.8647 in correctly predicting class labels, indicating correct predictions in 86.47% of cases (Fig. 4 D). However, it had the lowest overall performance among the top models. In contrast, the LSTM model demonstrated robust performance with an accuracy of 94.0%, effectively identifying names across both classes, as reflected in its confusion matrix (Fig. 4 C). The GRU model also performed well, achieving an accuracy of 0.95. However, it has classified male and female subjects better than the other models based on the confusion matrix (Fig. 4 B).This performance highlights the GRU's capability to capture most actual positive instances while still exhibiting notable error rates. Finally,the CNN model outperformed all others with an accuracy of 96.0%, achieving precision and recall scores of 0.96 for both male and female classes. Its confusion matrix (Fig. 4 A) highlights its ability to consistently and accurately classify names across categories, making it the most effective model overall. Discussion The results of this study demonstrate the effectiveness of CNN, LSTM, GRU,and XGBoost models in gender classification tasks, with each model showcasing distinct strengths. Among the models, CNN achieved the highest overall accuracy of 96%, followed by LSTM and GRU at 95% and 93% respectively. XGBoost, while effective as a traditional machine learning model, trailed slightly behind with an accuracy of 86%. These findings highlight the superiority of deep learning models in leveraging hierarchical and sequential patterns in names for gender classification. CNN’s performance can be attributed to its ability to learn hierarchical representations of character sequences, capturing both local and global patterns in names (Choudhary et al., 2021; Gao et al., 2023). Its high accuracy (96%) reflects its robustness in extracting substrings, phonetic structures, and cultural naming conventions that correlate with gender (Pramanik & Bag, 2021, Sharma et al., 2021). Additionally, CNN achieved the fastest test time of 5.61 seconds, further solidifying its efficiency in large-scale applications.These results align with prior research, such as Rego et al. (2021), which demonstrated the utility of CNNs in text classification. CNN’s computational efficiency, due to features like weight sharing and pooling layers, allowed it to generalize effectively without overfitting. LSTM and GRU models,with their sequential processing capabilities,also delivered strong performance. Both models demonstrated high precision, recall, and F1 scores of 0.95 for male and female classifications, indicating their ability to capture long-term dependencies in character sequences. LSTM's slightly higher accuracy (95%) compared to GRU (93%) may be attributed to its advanced gating mechanisms which are more effective at retaining relevant information over longer sequences. However, LSTM has significantly higher computational time (49.99 seconds) compared to GRU (10.44 seconds) and CNN suggests it is less suitable for real-time applications. These results are consistent with findings from Greff et al. (2016) and Lu & Salem (2017), where LSTMs were noted for their precision but requiring additional computational intensity. XGBoost, despite its lower accuracy of 86%, demonstrated competitive performance in handling imbalanced datasets and produced reasonably balanced precision and recall values for male and female classifications. However, its F1 score of 0.87 and relatively slower test time of 8.46 seconds highlight its limitations compared to deep learning models. XGBoost struggled particularly with female name classification, indicating potential biases in feature representation. These results align with findings from Zhang et al. (2022), where XGBoost was noted for its effectiveness in high-dimensional feature spaces but limited performance in complex, sequential tasks. The precision, recall, and F1 scores for all models reveal the strengths and limitations of their architectures. CNN, LSTM, and GRU all achieved balanced precision and recall values for both male and female classifications, demonstrating their ability to handle gender-ambiguous names. CNN, in particular, exhibited the lowest misclassification rates, reinforcing its ability to generalize across diverse naming conventions. XGBoost, while strong in male classifications, exhibited higher false-positive rates for female names, suggesting the need for further feature engineering or additional linguistic input. The confusion matrices of our top-performing models further highlight the strengths and limitations of each approach. CNN exhibited the lowest misclassification rates, particularly for ambiguous names, reinforcing its superior ability to generalize across diverse naming conventions. LSTM and GRU, while highly accurate, showed slightly higher false-positive rates compared to CNN, suggesting potential sensitivity to certain naming patterns. XGBoost, despite being robust for male name classification, struggled more with female names, indicating a potential bias in feature representation that future studies could address. Compared to existing studies, this work provides a comprehensive evaluation of model performance using a large and balanced dataset. While traditional models like Random Forest and Logistic Regression have shown F1 scores around 87% for similar tasks (Pham & Nguyen, 2023), our study highlights the clear advantage of deep learning models. Furthermore, the limited impact of advanced embeddings like BERT in this study underscores the simplicity of name-based datasets, where character-level patterns suffice for achieving high accuracy. Each model's architecture is uniquely suited to handle sequential and complex data patterns, positioning them as powerful tools for gender prediction and related applications in natural language processing. Limitations This study highlights the potential of machine learning models for automating gender classification of names, but it also underscores certain limitations.The dataset used in this study underrepresents North-East Indian names, which could result in biased models with limited generalizability across India's diverse population. Additionally, the models struggled with accurately classifying unisex names, leading to a higher likelihood of misclassification in such cases.While several models achieved high accuracy, significant rates of false positives and false negatives were observed, which could have critical implications in real-world applications where precision is essential. Furthermore, prediction times varied considerably among the models; for instance, the LSTM model, despite its strong performance had significantly slower prediction times (50 seconds) compared to CNN (5.61 seconds), potentially limiting its use in time-sensitive scenarios. Conclusion This study evaluated several machine learning and deep learning models for gender classification based on names, focusing on the accuracy, precision and recall of each model.These findings have significant implications for gender classification particularly in analyzing gender bias in sectors such as technology, media, academia, and business.The effectiveness of deep learning models in predicting gender from names offers a strong foundation for developing automated gender analysis tools. Among the models, CNN emerged as the most efficient and accurate achieving a 96% accuracy with the fastest prediction time of 5.61 seconds. LSTM followed closely with an accuracy of 95%, albeit with a much slower prediction time of 50 seconds. Both models demonstrated consistently high precision and recall across gender classes effectively identifying patterns in name data.These results highlight the potential of machine learning and deep learning models in gender classification of Indian names, offering a foundation for the development of automated tools in computational social science and other practical applications. Future research should focus on refining these models to enhance computational efficiency, reduce error rates, and address biases. Expanding the dataset to include more diverse linguistic and cultural representation particularly from underrepresented regions like North-East India will further improve model generalizability and robustness. CRediT authorship contribution statement SDG : Conceptualization, Resources, Software, Validation, Methodology, Supervision, Writing - original draft, Writing - review & editing. SH : Data curation, Writing - original draft, Visualization. DDG : Data curation, Methodology, Software, Validation, Writing - original draft. AM : Data curation, Methodology, Software, Visualization. AA : Data curation, Writing - original draft. ND : Writing - review & editing. PP : Writing - review & editing. Declarations CRediT authorship contribution statement SDG : Conceptualization, Resources, Software, Validation, Methodology, Supervision, Writing - original draft, Writing - review & editing. SH : Data curation, Writing - original draft, Visualization. DDG : Data curation, Methodology, Software, Validation, Writing - original draft. AM : Data curation, Methodology, Software, Visualization. AA : Data curation, Writing - original draft. ND : Writing - review & editing. PP : Writing - review & editing. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Declaration of generative AI and AI-assisted technologies in the writing process During the preparation of this work, the author(s) used [QuillBot] to improve clarity, engagement, and grammar. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication. Data availability The synthetic name generator code used in this study is publicly available on GitHub at github.com/ghatesudi/synthetic_names and can be freely accessed under the MIT License. Researchers interested in accessing the name predictor ML model may contact the corresponding author for further information or access instructions. ORCID ID Dhanush Ghate D https://orcid.org/0009-0004-9143-1267 Saishma H https://orcid.org/0009-0003-4454-3584 Adithya M https://orcid.org/0009-0004-2073-8210 Sudeep D. Ghate https://orcid.org/0000-0001-9996-3605 Neevan D’Souza https://orcid.org/0000-0002-4043-9638 Anjusha Alex https://orcid.org/0009-0004-2899-9554 Prakash Patil https://orcid.org/0000-0002-1263-8517 References Tripathi A, Faruqui M (2011) Gender prediction of Indian names. In IEEE Technology Students' Symposium (pp. 137–141). IEEE DeTienne DR, Chandler GN (2007) The role of gender in opportunity identification. Entrepreneurship theory Pract 31(3):365–386 Radhakrishnan S (2011) Appropriately Indian: Gender and culture in a new transnational class. Duke University Press Hu Y, Hu C, Tran T, Kasturi T, Joseph E, Gillingham M (2021) What’s in a name? – gender classification of names with character-based machine learning models. Data Min Knowl Disc 35(4):1537–1563 Amarappa S, Sathyanarayana S (2015) Kannada named entity recognition and classification (NERC) based on multinomial naïve Bayes classifier. Int J Nat Lang Comput, 4 Jia Y, Zhao Y (2019) Gender identification in Chinese names. Lingua 234:102759 Manik LP, Syafiandini AF, Mustika HF, Akbar Z, Rianto Y (2019) Gender inference based on Indonesian name and profile photo. In 2019 International Conference on Computer, Control, Informatics and its Applications (IC3INA) (pp. 25–29). IEEE Rego RC, Silva VM (2021) Predicting gender of Brazilian names using deep learning. arXiv:2106.10156. Sun, Z., Li, X., Sun, X., Meng, Y., Ao, X., He, Q., … Li, J. (2021). Chinesebert:Chinese pretraining enhanced by glyph and pinyin information. arXiv preprint arXiv:2106.16038. Singh KS (1996) Communities, segments, synonyms, surnames and titles, vol 8. Oxford University Press, USA Sharma DD (2005) Panorama of Indian Anthroponomy: An Historical, Socio-cultural & Linguistic Analysis of Indian Personal Names. Mittal Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … Zheng, X. (2016).{TensorFlow}: a system for {Large-Scale} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16)(pp. 265–283). Sammut C, Webb GI (eds) (2011) Encyclopedia of machine learning. Springer Science & Business Media Cavnar WB, Vayda AJ (1992) Using superimposed coding of N-gram lists for Efficient Inexact Matching, Proceedings of the Fifth USPS Advanced Technology Conference, Washington D.C Zhang Z (2015) The singular value decomposition, applications and beyond. arXiv preprint arXiv:1510.08532 Lefèvre S, Piantanida P (2016) Pytesseract: A Python wrapper for Google Tesseract-OCR. J Open-Source Softw 1(8):72 Fenniak M, Stamy M, Thoma M, Peveler M (2024) The pypdf library. https://pypi.org/project/pypdf/ Richardson L (2007) Beautiful soup documentation. https://beautiful-soup-4.readthedocs.io/en/latest/ D’Silva J, Sharma U (2022) Automatic text summarization of konkani texts using pre trained word embeddings and deep learning. Int J Electr Comput Eng 12(2):1990–2000. https://doi.org/10.11591/ijece.v12i2.pp1990-2000 Ghosh S (2021) Identifying click baits using various machine learning and deep learning techniques. Int J Inform Technol (Singapore) 13(3):1235–1242. https://doi.org/10.1007/s41870-020-00473-1 Qureshi KA, Sabih M (2021) Un-Compromised Credibility: Social Media Based Multi-Class Hate Speech Classification for Text. IEEE Access 9:109465–109477. https://doi.org/10.1109/ACCESS.2021.3101977 Rehman TU, Mahmud MS, Chang YK, Jin J, Shin J (2019) Current and future applications of statistical machine learning algorithms for agricultural machine vision systems. Comput Electron Agric 156:585–605. https://doi.org/10.1016/j.compag.2018.12.006 Syriopoulos PK, Kalampalikis NG, Kotsiantis SB, Vrahatis MN (2023) kNN Classification: a review. Ann Math Artif Intell 1–33. https://doi.org/10.1007/S10472-023-09882-X/METRICS Tüfekci P, Bektaş M (2022) Author and genre identification of Turkish news texts using deep learning algorithms. Sadhana - Academy Proceedings in Engineering Sciences , 47 (4), 194. https://doi.org/10.1007/s12046-022-01975-3 Zou J, Yuan C, Zhang X, Zou G, Wan ATK (2023) Model averaging for support vector classifier by cross-validation. Stat Comput 33(5):1–22. https://doi.org/10.1007/S11222-023-10284-6/ Rego RC, Silva VM, Fernandes VM (2021) Predicting gender by first name using character-level machine learning. arXiv preprint arXiv :210610156 Pham D, Nguyen L (2023), December Gendec: A Machine Learning-Based Framework for Gender Detection from Japanese Names. In International Conference on Intelligent Systems Design and Applications (pp. 235–244). Cham: Springer Nature Switzerland Ghosh S, Tyagi U, Suri M, Kumar S, Ramaneswaran S, Manocha D (2023) ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER. arXiv preprint arXiv:2306.00928. https://doi.org/10.48550/arXiv.2306.00928 Gao G, Yu Y, Yang J, Qi GJ, Yang M (2020) Hierarchical deep CNN feature set-based representation learning for robust cross-resolution face recognition. IEEE Trans Circuits Syst Video Technol 32(5):2550–2560 Choudhury A, Sarma KK (2021) A CNN-LSTM based ensemble framework for in-air handwritten Assamese character recognition. Multimedia Tools Appl 80(28):35649–35684 Sharma A, Lysenko A, Boroevich KA, Vans E, Tsunoda T (2021) DeepFeature: feature selection in nonimage data using convolutional neural network. Brief Bioinform 22(6):bbab297 Pramanik R, Bag S (2021) Handwritten Bangla city name word recognition using CNN-based transfer learning and FCN. Neural Comput Appl 33:9329–9341 Lu Y, Salem FM (2017), August Simplified gating in long short-term memory (lstm) recurrent neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS) (pp. 1601–1604). IEEE Zhang P, Jia Y, Shang Y (2022) Research and application of XGBoost in imbalanced data. Int J Distrib Sens Netw 18(6):15501329221106935 Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems , 31 LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324 Kenton JDMWC, Toutanova LK (2019), June Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT (Vol. 1, p. 2) Trendowicz A, Jeffery R, Trendowicz A, Jeffery R (2014) Classification and regression trees. Software Project Effort Estimation: Foundations and Best Practice Guidelines for Success , 295–304 Chen T, Guestrin C (2016), August Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794) Hochreiter S, Schmidhuber J (1997) Long Short-Term Memory, in Neural Computation, vol. 9, no. 8, pp. 1735–1780, 15 Nov. 1997. 10.1162/neco.1997.9.8.1735 Breiman L (2001) Random forests. Mach Learn 45:5–32 Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 Ghate D, Saishma H, Adithya M, Ghate SD (2025) Advancing Arecanut Quality Grading: A Comparative Analysis of YOLO Models with Hyperparameter Optimization. PREPRINT (Version 1) available at Research Square [ https://doi.org/10.21203/rs.3.rs-5755373/v1] Hunter JD (2007) Matplotlib: A 2D graphics environment. Comput Sci Eng 9(3):90–95 Waskom M et al (2020) Seaborn: statistical data visualization. J Open Source Softw 5(51):241 Virtanen P et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17(3):261–272 Panchenko A, Teterin A (2014), April Detecting gender by full name: experiments with the Russian language. In International Conference on Analysis of Images, Social Networks and Texts (pp. 169–182). Cham: Springer International Publishing Santosh TYSS, Sanyal DK, Das PP (2020) Person Name Segmentation with Deep Neural Networks. In Mining Intelligence and Knowledge Exploration: 7th International Conference, MIKE 2019, Goa, India, December 19–22, 2019, Proceedings 7 (pp. 32–41). Springer International Publishing Kabir MH, Ahmad F, Hasan MAM, Shin J (2022) Gender Recognition of Bangla Names Using Deep Learning Approaches. Appl Sci 13(1):522 Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2016) LSTM: A search space odyssey. IEEE Trans neural networks Learn Syst 28(10):2222–2232 Additional Declarations The authors declare no competing interests. Supplementary Files SupplementaryPreprintsMLnamegender.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5897194","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":406779906,"identity":"c24ec619-6227-4d00-b593-60bbf5fe918a","order_by":0,"name":"Sudeep D. Ghate","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5UlEQVRIiWNgGAWjYHACAyCW4OFnbz4AYsgQq8VCRrLnWAJYL7FaKmwMbuSAGAyEtcjPSN744OMOoOE3cj6/ulFjwcPAfvjoBrxW3EgrNpx5RoKHseftNuucY0C9PGlpN/Bqkcgxk+Ztk+BhZs/dZpzDBtQiwWOGV4v8jBzz33+BWtgYcp4Z5/wjQgvQC2bMjEAtPBw5zI9z24jQYnDmWbFkL9AvEjzHzJhz+4DWEfKLfHvyxg8/d9TZ2x9vfvw551udHD/74WP4HQYCjA1gik0CTBJUjqSF+QNRqkfBKBgFo2DEAQDor0VP/Ng/dAAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0001-9996-3605","institution":"Center for Bioinformatics, NITTE deemed to be University, Mangaluru -575018, India","correspondingAuthor":true,"prefix":"","firstName":"Sudeep","middleName":"D.","lastName":"Ghate","suffix":""},{"id":406779907,"identity":"9df0090a-fdef-4c04-936c-e5e0f3e1c646","order_by":1,"name":"Saishma H","email":"","orcid":"https://orcid.org/0009-0003-4454-3584","institution":"Center for Bioinformatics, NITTE deemed to be University, Mangaluru -575018, India","correspondingAuthor":false,"prefix":"","firstName":"Saishma","middleName":"","lastName":"H","suffix":""},{"id":406779908,"identity":"e25c0b9f-12f6-4a46-aa84-6257d587c71c","order_by":2,"name":"Dhanush Ghate D","email":"","orcid":"https://orcid.org/0009-0004-9143-1267","institution":"Department of Computer Science and Engineering, NMAM Institute of Technology, NITTE deemed to be University, Nitte - 574110, India","correspondingAuthor":false,"prefix":"","firstName":"Dhanush","middleName":"Ghate","lastName":"D","suffix":""},{"id":406779909,"identity":"122229c3-8312-4e5d-bf16-cb9a0d4edd85","order_by":3,"name":"Adithya M","email":"","orcid":"https://orcid.org/0009-0004-2073-8210","institution":"Department of Computer Science and Engineering, NMAM Institute of Technology, NITTE deemed to be University, Nitte - 574110, India","correspondingAuthor":false,"prefix":"","firstName":"Adithya","middleName":"","lastName":"M","suffix":""},{"id":406779910,"identity":"5d65820d-88cb-4671-b717-765c2ec98e89","order_by":4,"name":"Anjusha Alex","email":"","orcid":"https://orcid.org/0009-0004-2899-9554","institution":"Department of Biostatistics, KS Hegde Medical Academy, NITTE deemed to be University, Mangaluru -575018, India","correspondingAuthor":false,"prefix":"","firstName":"Anjusha","middleName":"","lastName":"Alex","suffix":""},{"id":406779911,"identity":"08d7abc9-c287-4f71-b777-977424a7c03b","order_by":5,"name":"Neevan D’Souza","email":"","orcid":"https://orcid.org/0000-0002-4043-9638","institution":"Department of Biostatistics, KS Hegde Medical Academy, NITTE deemed to be University, Mangaluru -575018, India","correspondingAuthor":false,"prefix":"","firstName":"Neevan","middleName":"","lastName":"D’Souza","suffix":""},{"id":406779912,"identity":"f920beea-ba72-44bc-b438-114e4d25247c","order_by":6,"name":"Prakash Patil","email":"","orcid":"https://orcid.org/0000-0002-1263-8517","institution":"Center Research Laboratory, KS Hegde Medical Academy, NITTE deemed to be University, Mangaluru -575018, India","correspondingAuthor":false,"prefix":"","firstName":"Prakash","middleName":"","lastName":"Patil","suffix":""}],"badges":[],"createdAt":"2025-01-24 16:28:22","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-5897194/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5897194/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":74993228,"identity":"a8336c80-0bfa-48a2-955b-c6a732c355ad","added_by":"auto","created_at":"2025-01-29 08:05:48","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":164115,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eWorkflow illustrating the process of gender classification using machine learning models. \u003c/strong\u003eThe diagram details steps including data collection from electoral rolls and websites, data cleaning, and preprocessing. It highlights feature engineering techniques such as character-level tokenization and embedding generation followed by model selection, hyperparameter tuning, validation and performance analysis.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-5897194/v1/50619aecdb0390d006730355.png"},{"id":74993230,"identity":"6a69ff1d-78c7-4323-8269-3ff9e3f81e2c","added_by":"auto","created_at":"2025-01-29 08:05:48","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":739616,"visible":true,"origin":"","legend":"\u003cp\u003eTraining vs. validation accuracy of final ML models for name-gender classification. Panels show: (A) CNN, (B) GRU, (C) LSTM, and (D) XGBoost. Panels A, B, and C illustrate accuracy trends over 150 epochs, while panel D shows accuracy across iterations for XGBoost. Early stopping was applied resulting in varying endpoint epochs for the models.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-5897194/v1/b9a9c35e6b1204caee8dcd29.png"},{"id":74994062,"identity":"3af98f0d-ed32-4061-8502-20d9a85e6dc0","added_by":"auto","created_at":"2025-01-29 08:13:48","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":701081,"visible":true,"origin":"","legend":"\u003cp\u003eTraining vs. validation loss of final ML models for name-gender classification. Panels include: (A) CNN, (B) GRU, (C) LSTM and (D) XGBoost. Panels A, B and C display loss trends over 150 epochs reflecting training progress whereas panel D shows loss across iterations for XGBoost. Early stopping resulted in the models terminating at different epochs.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-5897194/v1/07d652debcc7946978873e95.png"},{"id":74993235,"identity":"686a1b49-39c7-4fd1-b218-e7cd23904232","added_by":"auto","created_at":"2025-01-29 08:05:48","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":156351,"visible":true,"origin":"","legend":"\u003cp\u003eConfusion matrices for the four best-performing models (CNN, GRU, LSTM, and XGBoost) on the testing dataset for name-gender classification. Diagonal values indicate correctly classified instances, while off-diagonal values represent misclassifications. Predicted labels are shown on the x-axis and actual labels on the y-axis.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-5897194/v1/e3826bc8563cd1bf8535d185.png"},{"id":74994326,"identity":"7c4747f1-c886-43f5-9ab9-e75bf51007e0","added_by":"auto","created_at":"2025-01-29 08:21:50","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2965758,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5897194/v1/f9ba8c17-00c9-4ddf-b098-9cef54e86190.pdf"},{"id":74993234,"identity":"1490e47e-aa1e-4579-91a4-082e0e871c65","added_by":"auto","created_at":"2025-01-29 08:05:48","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":2803274,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryPreprintsMLnamegender.docx","url":"https://assets-eu.researchsquare.com/files/rs-5897194/v1/b81efcca043966939f612695.docx"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eDecoding Gender: A Machine Learning Approach for Classifying Indian Names with Advanced Feature Extraction\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eIndia's cultural diversity creates significant challenges in gender classification based on names due to the country's diverse languages, religions, and naming conventions. Names in India are influenced by regional, religious, and familial factors reflecting distinct linguistic and cultural traditions. For instance, the same name may have different gender associations across regions or religions, adding complexity to automated classification systems (Tripathi et al., 2011; Sharma, 2005). Accurately determining gender from names is essential across various fields such as demography, healthcare, and marketing. In demography, gender classification supports accurate data analysis and policy development. In these fields, effective gender classification supports tasks like data segmentation, gender-specific research, and personalized services. Given India\u0026rsquo;s cultural and linguistic diversity, developing robust methods for gender classification becomes imperative (DeTienne et al., 2007).\u003c/p\u003e \u003cp\u003eMachine learning (ML) and natural language processing (NLP) offer promising solutions by learning patterns from labeled datasets, capturing complex cultural and linguistic nuances. Unlike traditional statistical methods that often fail to account for India's naming intricacies, ML models leverage large datasets to better adapt to cultural diversity. They also help reduce biases inherent in models trained primarily on Western datasets, creating more inclusive systems (Hu et al., 2021; Ghosh et al., 2023). For instance, ML models can distinguish subtle patterns in suffixes, prefixes, or phonetics that align with gender-specific naming conventions.\u003c/p\u003e \u003cp\u003eDespite progress, existing models frequently fall short in accuracy and inclusivity due to limited representation of Indian names in training datasets. Most prior research has either relied on Western-centric datasets or applied statistical methods, which overlook the cultural and linguistic diversity of Indian names, resulting in lower accuracy. This study addresses these gaps by developing predictive models using advanced ML techniques to improve gender classification from names. A key component of our approach is the creation of a comprehensive labeled dataset, ensuring representation across India's varied cultural and linguistic contexts. By exclusively focusing on Indian names, this study enhances gender classification for a culturally complex demographic and sets a stage for integrating diverse cultural perspectives into ML applications. While acknowledging that gender identity is not strictly binary, this study approaches gender inference as a binary classification task for practical applications such as policy development and data analytics (Radhakrishnan, 2011). Our research aims to advance automated gender classification systems by improving accuracy and cultural sensitivity. This contributes to more effective data analysis and informed decision-making across domains like healthcare, marketing, and social sciences, particularly in contexts requiring gender-segmented analysis.\u003c/p\u003e \u003cp\u003eDespite advancements in machine learning, several open questions remain regarding the classification of gender from Indian names. For instance, how effectively can machine learning models address the linguistic and cultural complexities inherent in Indian names? Can training on a culturally representative dataset mitigate biases and enhance accuracy across India\u0026rsquo;s diverse naming conventions? And, to what extent do such models generalize across regions with distinct linguistic influences? We hypothesized that a comprehensive, labeled dataset of Indian names, combined with advanced machine learning techniques, could significantly improve the accuracy and inclusivity of gender classification. To explore this, we developed a culturally diverse dataset of Indian names and trained machine learning models capable of capturing intricate patterns unique to Indian naming conventions. Our findings offer valuable insights into the design of culturally sensitive machine learning systems, with practical implications in areas such as human-computer interaction, marketing intelligence, and social analytics.\u003c/p\u003e"},{"header":"Related works","content":"\u003cp\u003eMany studies have employed ML strategies for named entity recognition and gender classification highlighting their effectiveness across various languages and datasets (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Traditional methods, such as Naive Bayes, have been widely used. For example, a Naive Bayes system achieved 77.2% accuracy in classifying Kannada names (Amarappa \u0026amp; Sathyanarayana, 2015). Logistic regression has also been explored for name-based gender classification, achieving 83.7% accuracy on Indonesian names, which improved to 98.6% when combined with Convolutional Neural Network (CNN) analyzing profile photos (Manik et al., 2019). Deep learning approaches have shown significant promise; \u003cem\u003eviz\u003c/em\u003e embedding Chinese names with the Pinyin approach using the BERT model achieved 95% accuracy, outperforming Naive Bayes and Gradient Boosting methods (Sun et al., 2021). Similarly, Dual- Long Short-Term Memory (LSTM) models effectively classified genders in a dataset of 21\u0026nbsp;million first names (Hu et al., 2021), while neural networks like MLPs and Gated Recurrent Units (GRU) consistently achieved over 90% accuracy on Brazilian names (Rego \u0026amp; Silva, 2021). A study on Bangladeshi names achieved a peak accuracy of 92.16% using Conv1D models, addressing challenges posed by unisex names and proposing future directions for improvement (Kabir et al., 2022).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComprehensive summary of related works on name-gender classification\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eArticle\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOrigin of Names\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLanguage\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eProcessing Methods\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eDataset and Results\u003c/p\u003e \u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAmarappa and Sathyanarayana (2015)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNA\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eKannada\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNaive Bayes classifier\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e10-fold cross-validation accuracy of 77.2% for classifying Kannada names.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eJia and Zhao (2019)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChinese\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eChinese\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLogistic regression on written and spoken names, BERT with Pinyin embedding\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAccuracy of 93%\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eManik et al. (2019)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eIndonesian\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eIndonesian\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLogistic regression on names, CNNs on profile photos\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAccuracy of 83.7% for names, 98.6% for names and photos\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHu et al. (2021)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUSA\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEnglish\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCharacter-based\u003c/p\u003e \u003cp\u003emachine learning\u003c/p\u003e \u003cp\u003emodels such as\u003c/p\u003e \u003cp\u003eLSTM and BERT.\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eA dataset of 21\u0026nbsp;million unique first names from SSA and YAHOO data. LSTM and BERT-based different models obtained approximately 87% and 88% accuracy, respectively.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRego and Silva (2021)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBrazilian\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEnglish\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eDNN-based\u003c/p\u003e \u003cp\u003emodels, including\u003c/p\u003e \u003cp\u003eMLP, RNN, GRU,\u003c/p\u003e \u003cp\u003eCNN, and\u003c/p\u003e \u003cp\u003eBi-LSTM.\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eThe dataset consists of 100,787 Brazilian names, of which 54.82% are female names and 45.18% are male, based on 2010 CENSO data.\u003c/p\u003e \u003cp\u003eCNN, Bi-LSTM, RNN, MLP, and GRU-based models achieved 92%, 95%, 93%, 86%, and 94%\u003c/p\u003e \u003cp\u003eaccuracy, respectively.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePanchenko et al. (2014)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRussian\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEnglish\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eStatistical models\u003c/p\u003e \u003cp\u003ebased on one type\u003c/p\u003e \u003cp\u003eof features:\u003c/p\u003e \u003cp\u003eendings, character\u003c/p\u003e \u003cp\u003etrigrams, and\u003c/p\u003e \u003cp\u003edictionary.\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eA dataset of 100,000 Russian\u003c/p\u003e \u003cp\u003efull names from Facebook.\u003c/p\u003e \u003cp\u003eAccuracy of upto 96% is\u003c/p\u003e \u003cp\u003eobtained from a combined\u003c/p\u003e \u003cp\u003emodel:\u003c/p\u003e \u003cp\u003eendings + 3-g + dictionary.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTripathi and Faruqui (2011)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eIndian\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEnglish\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMorphological features and n-gram suffixes with SVM classification\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eA dataset of 2000 Indian names in Gujrati, Tamil, Telugu, Hindi, Urdu, Bengali and Tamil names. The training dataset had 890 female and 1110 male names. This model achieved the maximum F1 score at 94.9%.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKabir et al. (2022)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBangladesh\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBangla\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eConv1D, LSTM, BiLSTM, and Stacked LSTM models were used for gender recognition, with tokenized characters converted into numerical embeddings and optimized through hyperparameter tuning.\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eA dataset of 2,030 unique Bangladeshi names. The Conv1D model achieved the highest accuracy at 91.18%.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e \u003cp\u003e\u003c/p\u003e \u003cp\u003eSeveral studies have explored Indian names specifically to classify them according to gender. Morphological traits and n-gram suffixes improved SVM-based classification (Tripathi \u0026amp; Faruqui, 2011). A character-level BiLSTM model with a conditional random field achieved 94% accuracy in segmenting Indian author names, demonstrating the effectiveness of deep learning for handling diverse naming conventions. (Santosh et al., 2020). Simpler models, such as the SimpleText model, have also performed well, achieving 94.67% accuracy on a dataset of 84,899 names, comparable to more complex architectures like LSTMs (Ghosh, 2021). These studies underscore the potential of machine learning for name-based gender classification while highlighting challenges such as linguistic diversity, unisex names, and cultural nuances. Building on these insights, our study focuses on leveraging advanced machine learning techniques with a comprehensive dataset of Indian names to address these complexities effectively.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003cdiv id=\"Sec4\" class=\"Section3\"\u003e \u003c/div\u003e \u003c/div\u003e\n\n\n\n \n\n \n\n\n\n "},{"header":"Methodology","content":"\u003ch2\u003eData Collection and Preparation\u003c/h2\u003e\u003cp\u003eThe electoral roll data was manually downloaded from the Election Commission of India (ECI) website (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://voters.eci.gov.in/download-eroll\u003c/span\u003e\u003cspan address=\"https://voters.eci.gov.in/download-eroll\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e, last accessed on Nov 5, 2024). Each state’s data was available either in English or a regional language, with preference given to English files. The PDF files contained structured information for each individual, including name, relative’s name, address, and gender. However, as the name data was often in image format, we utilized optical character recognition (OCR) with Pytesseract (Lefèvre et al., 2016) and PyPDF2 (Fenniak et al., 2022) for text extraction.\u003c/p\u003e\u003cp\u003eThe text extraction process involved converting each page of the PDF to an image, enhancing image quality (contrast adjustments, noise reduction), and applying OCR. Irrelevant pages were skipped, and regular expressions were employed to identify name fields and gender-related information. Names extracted from regional language PDFs were translated to English using Google Translate to ensure uniformity. A total of 150–200 files from each state were processed based on the population share of the state. Names with missing or ambiguous gender information were excluded. The results were stored in dictionaries, and the final dataset was saved as CSV files for subsequent analysis.\u003c/p\u003e\u003cp\u003eThe details of names extracted from each state are given in supplementary Table S1. Challenges included regional language barriers and inconsistencies in electoral roll maintenance across states. The regional language entries pose a significant barrier to uniformity in electoral rolls. For instance, while names in South Indian states (except Kerala), Delhi, Jammu and Kashmir, and the northeastern states were often available in English, those in Assam, Tripura, Gujarat, Odisha, Punjab, and Kerala were recorded exclusively in regional languages. These discrepancies complicated the data extraction process. Additionally, gender entries in some records used non-standard terms like \"boy\" or \"girl,\" requiring manual resolution to maintain consistency.\u003c/p\u003e\u003ch3\u003eDataset Augmentation\u003c/h3\u003e\u003cp\u003eTo enrich the dataset, additional male and female names were extracted from various websites using the Beautiful Soup Python library (Richardson, 2007) for web scraping. This process emphasized culturally and religiously diverse names to reflect the demographic variety of the population. The sources and counts of names from different websites are detailed in Supplementary Table S2. Further, insights from the literature (Singh KS, 1996; Sharma, 2005) guided the generation of a comprehensive list of first and surnames across states, ensuring representation of religions, cultures, tribes, and communities. Custom Python scripts generated random combinations of first and last names, including suffixes and tribal names where applicable. Details of synthetically generated names are provided in Supplementary Table S3. To facilitate broader usability, the scripts developed for text extraction, name synthesis, and data processing were consolidated into a standalone Python package. The package’s current version and documentation are available at [\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://pypi.org/project/indigen/\u003c/span\u003e\u003cspan address=\"https://pypi.org/project/indigen/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e].\u003c/p\u003e\u003ch3\u003eData Cleaning and Preprocessing\u003c/h3\u003e\u003cp\u003eThe raw data from electoral name files underwent a comprehensive cleaning process to ensure uniformity and relevance for further analysis. Punctuation, honorifics, and titles (e.g., \"Mr.,\" \"Mrs.,\" \"Miss,\" \"Ms.\") were removed, and initials or words shorter than three characters were eliminated. Names were standardized by removing special characters, converting them to lowercase, and trimming extraneous spaces. These steps ensured that only relevant and valid names remained for further analysis. An iterative filtering process was applied to exclude names with fewer than four characters, more than three words, or overly complex structures. Additionally, uncleaned entries (e.g., short or overly complex names) were stored separately for further review or reprocessing. Rows with missing values were dropped and gender labels were standardized by converting them to lowercase and replacing variations like 'woman', 'women', 'man', and 'men' with 'female' and 'male'. Instances of names containing specific substrings were analyzed to check for contextual inconsistencies in gender labeling. For names where female related substring was present but not as the first word, gender labels were updated to 'Female' based on cultural conventions of the state after manual verification. Names classified as artifacts (e.g., containing repeated letters such as 'aaa' or special characters) were identified and removed. Periods, single-letter entries, and unwanted characters such as ‘?, *, or #’ were also cleaned using regex-based transformations. Finally, all names were converted to title case for uniformity.\u003c/p\u003e\u003cp\u003eA custom script was developed to clean Hindi names from ‘Hindi language electoral data states’ that were incorrectly translated into English while using Google translate. To effectively manage this, we utilized the NLTK library (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.nltk.org/\u003c/span\u003e\u003cspan address=\"https://www.nltk.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) [NLTK Project. (2001–2024)]. Using the NLTK library, a set of valid English words was created and cross-checked against translated names. A custom exclusion list of valid Indian names was also applied to identify and remove inconsistencies while avoiding the removal of legitimate Indian names. Names incorrectly flagged as English words but unrelated to valid translations were identified and exported for manual verification and cleaning. After merging the website and synthetic names, duplicate removal was performed by grouping name and gender to obtain the final non-electoral dataset which was then merged with cleaned data from the electoral site to form the final dataset for machine learning.\u003c/p\u003e\u003ch3\u003eFeature engineering\u003c/h3\u003e\u003cp\u003eA comprehensive feature engineering strategy was employed to transform the raw text data (names) into meaningful representations for machine learning models. For traditional machine learning algorithms such as KNN, CatBoost, Decision Tree, GRU, Logistic Regression, MLP, Random Forest, and XGBoost, vectorization techniques like Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) (Sammut and Webb, 2011) and character bigrams (Cavnar et al., 1992) were utilized to capture both frequency-based and sequential patterns in the data. To address the issue of high dimensionality, Singular Value Decomposition (SVD) (Zhang, 2015) was applied to these vectorized features, reducing them to 100 components each.\u003c/p\u003e\u003cp\u003eAdditionally, manual features such as name length, vowel counts, and phonetic representation lengths (derived using the Double Metaphone algorithm) were incorporated to highlight linguistic and cultural trends. These features were normalized and combined into a unified feature matrix. Finally, Principal Component Analysis (PCA) was applied to the combined matrix, retaining 95% of the variance for efficiency. For models leveraging BERT-based feature engineering, tokenized names were processed through a pre-trained DistilBERT transformer model to generate embeddings that captured semantic and contextual information. These embeddings were reduced to 128 components using PCA and subsequently integrated with the traditional and manual features to provide a robust input for machine learning algorithms.\u003c/p\u003e\u003cp\u003eFor deep learning models such as LSTM, CNN, and GRU, a distinct feature engineering approach was adopted to capture sequential character-level patterns. Names were tokenized at the character level, and the resulting sequences were padded to a uniform maximum length to ensure consistency across inputs. In addition to the sequential data, auxiliary features like name length were extracted and processed separately. The padded sequences, along with the auxiliary features, were fed into the models, enabling them to capture both sequential patterns and supplementary linguistic information. The target variable, gender, was label-encoded to ensure compatibility with neural network architectures. The processed features were then split into training and testing datasets to facilitate model evaluation and performance benchmarking. The final feature matrix, comprising dimensionality-reduced embeddings, manually engineered attributes, and interaction terms, provided a balanced and enriched representation of the dataset. This matrix served as a unified input for machine learning models, allowing them to effectively leverage both linguistic and statistical nuances for enhanced classification performance.\u003c/p\u003e\u003ch2\u003eModel selection\u003c/h2\u003e\u003cp\u003eTwelve pre-trained deep learning models - KNN, CatBoost (Prokhorenkova et al., 2018), CNN (LeCun et al., 1998), Decision Tree (Ryzin, 1986), GRU (Cho, 2014), Logistic Regression (Trendowicz et al., 2014), Logistic Regression with BERT, LSTM (Hochreiter \u0026amp; Schmidhuber, 1997), MLP, Random Forest (Breiman, 2001), XGBoost (Chen, 2016), and XGBoost with BERT were adapted to classify gender based on names. These models were selected for their proven effectiveness across various classification tasks ensuring suitability for our specific application. K-Nearest Neighbors (KNN) was chosen for its simplicity and interpretability providing a foundational comparison point. CatBoost, known for handling categorical data efficiently, was selected for its strength in preventing overfitting. CNN excel in capturing spatial hierarchies within name sequences, while GRU and LSTM models were chosen for their ability to handle sequential dependencies in names. Logistic Regression served as a basic model for comparison and combining it with BERT (Kenton, 2019) provided enhanced contextual understanding. Random Forest and XGBoost were included for their ensemble learning capabilities and speed with XGBoost with BERT combining the strengths of both to capture deeper patterns in names. Each model was selected to address the unique complexities of name-based gender classification ensuring high accuracy and cultural sensitivity.\u003c/p\u003e\u003ch3\u003eHyperparameter tuning\u003c/h3\u003e\u003cp\u003eHyperparameter tuning for all models was conducted using Bayesian Optimization via Optuna, with a custom implementation employing a 5-fold cross-validation approach over 30 trials. For GRU, LSTM, and CNN models, the tuning process included 20 epochs of training with early stopping, configured with a patience of 3 to halt training if no improvement in performance was observed. 10% of the original data was used for this initial screening. The dataset was split into training and evaluation sets using a stratified split, with 70% of the data allocated for training and 30% for evaluation. Validation accuracy from each fold served as the optimization metric.\u003c/p\u003e\u003ch3\u003eModel architecture\u003c/h3\u003e\u003cp\u003eThe architectures for the LSTM, GRU, and CNN models were designed to process sequential data alongside additional features. Each model utilized an embedding layer to process input sequences of characters, followed by either a recurrent or convolutional layer. The GRU model employed a Bidirectional GRU layer, where the number of units and dropout rates were tuned during the hyperparameter optimization process. The output from the GRU layer was then concatenated with an additional input feature and passed through a dense layer with ReLU activation and L2 regularization, with the final output produced by a sigmoid activation function for binary classification. The CNN model followed a similar input structure, using an embedding layer followed by a 1D convolutional layer where tunable filters and kernel sizes were used. After pooling and flattening the feature maps, these were concatenated with the additional input feature, followed by dense layers with ReLU activation, dropout for regularization, and L2 regularization, before outputting the final binary classification result via a sigmoid activation. The LSTM model mirrored the GRU model, using a Bidirectional LSTM layer instead, with similarly tuned dropout rates and recurrent dropout. The output from the LSTM layer was concatenated with the additional feature passed through a dense layer with ReLU activation and L2 regularization, concluding with a sigmoid activation. These architectures were optimized to handle sequential data and additional features, allowing the models to learn effectively from complex patterns.\u003c/p\u003e\u003ch2\u003eFinal training, validation and testing\u003c/h2\u003e\u003cp\u003eOnce the optimal hyperparameters were determined, each model was initially trained on a 10% subset of the training dataset using these settings. After tuning, the models were evaluated on the test dataset to assess their generalization performance. Metrics such as accuracy, classification reports, and confusion matrices were calculated to quantify the reliability and effectiveness of the models. From the list of 12 models, the best 4 were selected based on accuracy and other performance metrics, and these models were retrained using the full training dataset. A patience of 8 and 100 epochs were used for this final training phase. For XGBoost, since epochs cannot be used, hyperparameter tuning was performed again with the full dataset across 50 trials and the best parameters were used for the final model. The final models were saved as .h5 files and testing was performed on the full test dataset. For the final evaluation, the time taken for prediction, confusion matrix, precision, recall, F1-score, and support were calculated for the selected models. Additionally, training vs. validation loss and training vs. validation accuracy were plotted to visualize the convergence and performance trends.\u003c/p\u003e\u003ch2\u003eHardware specifications\u003c/h2\u003e\u003cp\u003eThe experiments were performed on a system with 32GB of RAM and an NVIDIA RTX 3060 GPU with 12GB of memory providing sufficient computational resources for the training and evaluation of the deep learning models. The models were trained using Tensorflow (Abadi et al., 2016) and the experiments were executed within the Jupyter Notebook to facilitate iterative development and evaluation. Scripts are available in the accompanying GitHub repository for reproducibility and further exploration.The following libraries were utilized in the analysis: Python 3.10, Tensorflow pandas, matplotlib (Hunter et al., 2007), seaborn (Waskom M et al., 2020), SciPy (Virtanen P et al., 2020), and numpy.\u003c/p\u003e\u003ch2\u003eData Access and Ethical Considerations\u003c/h2\u003e\u003cp\u003eWe adhered to all applicable guidelines and permissions governing the use of electronic records, ensuring compliance with data protection policies. Data extraction was limited to non-identifiable information (names and general demographic indicators) and conducted responsibly following approval from the Institutional Scientific and Ethical Board (NU/CEC/2024/723 study number CEC/275 dt. 15.11.2024). To safeguard privacy, all data was anonymized upon extraction, retaining only names for model training. Sensitive information such as addresses or personal identifiers was not collected. Following model training, the raw electoral roll data was permanently deleted and only aggregated results and statistical summaries were retained. Access to the data was restricted to authorized study personnel and all procedures complied with Institutional Ethical Committee guidelines and the Information Technology Act of India, emphasizing our commitment to data protection and privacy.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eA flowchart illustrating the process of data collection, cleaning and classification of gender based on names has been provided (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The names were extracted from electoral rolls, additional names were gathered by web scraping from various name websites, and synthetically generated names were also added to enrich the dataset. Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e summarizes the dataset used for name-gender classification. It shows the number of male and female names in each stage of data collection and processing, starting with raw electoral data and ending with the final test set. The dataset includes both real-world data from electoral rolls and websites, as well as synthetically generated names to balance the dataset. The combined dataset for name-gender classification includes 15,74,573 male names and 15,56,468 female names, totaling 31,31,042 names, after incorporating cleaned electoral data, names from websites, and synthetically generated names. Data cleaning was conducted to eliminate irrelevant characters and maintain the integrity of the names retained for analysis. Advanced feature engineering techniques mentioned in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e such as Bag-of-Words (BoW), Word2Vec, Character n-grams, and Term Frequency-Inverse Document Frequency (TF-IDF) transformed raw text data into significant representations for machine learning applications. Dimensionality reduction methods such as SVD and PCA on BERT to reduce feature space and enhance model performance. Utilizing a balanced dataset of names classified by male and female gender, we systematically investigated various deep learning architectures and machine learning algorithms mentioned in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, including CNN, GRU, LSTM, and XGBoost to determine the most effective model for this classification task. Ten percent of the dataset was allocated for training and testing purposes. Graphs showing training and validation accuracy and loss over iterations were generated during the hyperparameter tuning process for all models. These figures (Fig. S1 and Fig. S2) illustrate the convergence behavior and performance trends. The top-performing models are selected based on their validation performance and retrained on the entire dataset (training\u0026thinsp;+\u0026thinsp;validation) to enhance generalization. The final performance is then evaluated on a separate test dataset, and the model with the best test set performance is chosen for name-gender classification.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eDistribution of male and female names in the name-gender classification dataset\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale names\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFemale names\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTotal\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eElectoral data before cleaning\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e12,57,740\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e21,22,284\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e33,80,024\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eElectoral data after cleaning\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10,87,130\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e9,45,976\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e20,33,075\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWebsites\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e59,766\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e53,781\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1,13,547\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSynthetic (before duplicate removal)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10,57,074\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e12,13,053\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e22,70,127\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSynthetic (after duplicate removal)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e4,87,443\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e6,10,492\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e10,97,967\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFinal combined and cleaned training data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e15,74,573\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e15,56,468\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e31,31,042\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFinal test dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1,23,569\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1,17,300\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2,40,859\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eFeature engineering techniques used for name-gender classification: Advantages and Trade-offs\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTechnique\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDescription\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAdvantages for Name Classification\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLimitations\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eRole in This Work\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBag-of-Words (BoW)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConverts names into vectors of word counts or frequencies.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSimple, fast, and effective for capturing frequent name patterns.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eIgnores order, context, and phonetics, limiting nuanced recognition.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCaptured basic frequency-based name patterns.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTF-IDF\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWeighs names by term importance in the dataset.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHighlights distinctive name components and rare patterns.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eIgnore sequence, semantic relationships, and name structure.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eEnhanced feature representation for name uniqueness.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCharacter Bigrams\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSplits names into overlapping two-character sequences.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCaptures character-level patterns in prefixes, suffixes, and roots.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eHigh dimensionality; sparse representations can lead to overfitting.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eIdentified morphological trends like prefixes/suffixes.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSingular Value Decomposition (SVD)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eReduces dimensionality of BoW, TF-IDF, and n-grams features.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eImproves efficiency while retaining key patterns in names.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLoss of interpretability and fine-grained details.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eReduced vectorized features to 100 components.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAdditional Features\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLength of names, vowel counts, and phonetic lengths.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEncodes intuitive patterns, e.g., short male names or vowel-rich female names.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLimited predictive power if used in isolation.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eHighlighted linguistic trends in names.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDouble Metaphone\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEncodes names into phonetic representations.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eGroups similar-sounding names,\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCan oversimplify phonetic variations in complex names.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCaptured phonetic similarity trends in gendered names.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePrincipal Component Analysis (PCA)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eReduces dimensions of combined feature matrices.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eImproves computational efficiency; focuses on variance.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMay lose semantic and contextual nuances.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eRetained 95% variance for efficiency.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDistilBERT Embeddings\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eContextual semantic embeddings from pre-trained transformer.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCaptures cultural, regional, and contextual name patterns.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eComputationally intensive; may need fine-tuning for specific datasets.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eProvided robust, contextual name representations.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTokenized Character Sequences\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConverts names into uniform-length character sequences.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePreserves order, enabling models to learn sequential patterns.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSequence padding/truncation can affect input consistency.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eInput for neural models like CNN, GRU, and LSTM.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLabel Encoding\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEncodes gender (target) numerically for models.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSimple and compatible with machine learning models.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eDoesn't capture complex target relationships (e.g., non-binary genders).\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eEncoded binary gender labels.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eStrengths and weaknesses of models used in this study\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eStrengths\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eWeaknesses\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eExcellent for learning local patterns in character sequences.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRequires large amounts of labeled data and may struggle with long-range dependencies.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDecision Tree\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHighly interpretable and easy to visualize.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProne to overfitting, especially with small datasets.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGRU\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEfficient at capturing sequential dependencies.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eStruggles with learning very long-term dependencies.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSimple and interpretable, effective for small datasets.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eComputationally expensive and less suitable for large datasets.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLogistic Regression\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSimple, interpretable, and fast for binary classification.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAssumes linear relationships, struggles with complex or nonlinear patterns.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLogistic Regression with BERT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCombines simplicity with contextualized feature extraction.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eComputationally expensive due to embedding generation.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLSTM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCaptures long-range dependencies in character sequences.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eComplex, slow to train, and resource-intensive.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMLP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eVersatile and effective for structured inputs like feature matrices.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProne to overfitting without regularization.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRandom Forest\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRobust and reduces overfitting by averaging multiple decision trees.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLess interpretable than single decision trees.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHighly efficient, scalable, and fast with boosting techniques.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSensitive to hyperparameter tuning, which can affect performance.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost with BERT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCombines contextual feature extraction with powerful boosting.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eResource-intensive and requires careful optimization.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eJustification for choosing the model\u003c/h2\u003e \u003cp\u003eTwelve algorithms were selected to represent a diverse range of techniques. Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e summarizes the weaknesses and key strengths of these machine learning models, providing a comparative overview. These included parametric and non-parametric approaches, linear and non-linear models, and ensemble methods. This diversity makes them suitable for handling various types of data and classification problems. Random Forest is an ensemble method that excels at handling large datasets with many features, reducing overfitting and improving generalization for both classification and regression tasks (Iranzad \u0026amp; Liu, 2024). For high-dimensional classification tasks, Support Vector Classification (SVC) is effective in handling complex nonlinear relationships and clear decision boundaries (Zou et al., 2023). For classifying data based on the majority class of its nearest neighbors, K-Nearest Neighbors (KNN) makes it ideal for tasks like image classification and recommendation systems (Syriopoulos et al., 2023). Gaussian Naive Bayes (GaussianNB) assumes feature independence and a Gaussian distribution, making it efficient for high-dimensional data and tasks like spam filtering and sentiment analysis (Rehman et al., 2019; T\u0026uuml;fekci \u0026amp; Bektaş, 2022). Decision Tree Classifier partitions feature space to create human-readable decision rules but may overfit with noisy data (Rehman et al., 2019). The MLP Classifier, a neural network model, can learn complex nonlinear patterns, making it suitable for tasks like image classification and natural language processing (D\u0026rsquo;Silva \u0026amp; Sharma, 2022). XGBoost, an ensemble method combining decision trees via gradient boosting, is known for high performance with large datasets (Ghosh, 2021). Alternatively, CatBoost handles categorical features effectively by using a unique encoding algorithm (Qureshi \u0026amp; Sabih, 2021). GRUs and LSTM networks are RNNs designed for sequence prediction, with GRUs excelling in time-series and NLP tasks, and LSTMs capturing long-range dependencies for applications like speech recognition and text generation. Lastly, 1D Convolutional Neural Networks (1D CNNs) are specialized for processing sequential data, particularly effective in time-series analysis and signal processing by learning spatial hierarchies of features.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eHyperparameter tuning\u003c/h2\u003e \u003cp\u003eOptimal hyperparameters were identified through a comprehensive tuning process over 30 epochs, ensuring each model achieved its best performance. The variations of the hyperparameters selected for tuning are detailed in Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e. The final configurations for each machine learning model, showcasing the parameters that produced the most accurate results, were determined through an extensive hyperparameter search (Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). Key hyperparameters such as neighbour count, regularization strength, learning rate, batch size, dropout rate, and L2 regularization were extensively tuned. CatBoost stood out with a score of 0.7997, with L2 leaf regularization (\u0026lsquo;l2_leaf_reg\u0026rsquo;) was set to 5.84, alongside a learning rate of 0.0959. MLP achieved 0.8054 by configuring hidden layers and regularization parameters to enhance generalization. Log-BERT and XGBoost both achieved a leading score of 0.8912, with a gamma value for regularization of 1.3797, and a learning rate of 0.2290 demonstrating their robustness in handling complex datasets. LSTM and GRU models achieved 88.5% and 88.1% validation accuracy, respectively, with carefully tuned batch sizes, units, regularization and learning rates 0.0033 and. The CNN achieved a validation accuracy of 84.6% with optimal hyperparameters that included a batch size of 32. The dropout rate was maintained at 20%, with an L2 regularization value of 7.45e-6 and a learning rate of 0.00162. These configurations contribute to the CNN\u0026rsquo;s performance in classification tasks. With the best hyperparameters, the GRU model had an 88.1% validation accuracy. These were a batch size of 32, GRU units set to 128, a 20% dropout rate, and 58 dense units. The L2 regularization was set to 1.53e-6, with a learning rate of 0.00093. These parameters enhance the GRU\u0026rsquo;s ability to capture sequential dependencies. These results demonstrate the importance of selecting appropriate hyperparameters for achieving optimal model performance across various algorithms.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eHyperparameters used for model optimization with range and description\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHyperparameter\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRange/Values\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eDescription\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eLogistic Regression\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC, solver\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[0.001, 0.01, 0.1], ['saga', 'lbfgs']\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eRegularization strength; optimization algorithm.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_iter, penalty\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[1000, 2000, 3000], ['l2', 'l1']\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMax iterations; regularization type.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRandom Forest\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003en_estimators, max_depth\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[100, 200], [None, 10, 20]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNumber of trees; tree depth.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003en_neighbors\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNumber of nearest neighbors.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDecision Tree\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_depth, min_samples_split\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[None, 10, 20], [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTree depth; min samples for splitting.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eMLP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ehidden_layer_sizes, alpha\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[(50,), (100,), (50, 50)], [0.0001, 0.001]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eHidden layers; regularization strength.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_iter, learning_rate_init\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[1000], [0.0001, 0.001]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMax iterations; initial learning rate.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eactivation, early_stopping, n_iter_no_change\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e['relu'], [True], [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eActivation function; early stopping settings.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emax_depth, learning_rate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e(3, 10), (0.01, 0.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTree depth; step size shrinkage.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003en_estimators, gamma\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e(50, 500), (0, 5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eBoosting rounds; min loss for splits.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003emin_child_weight, subsample, colsample_bytree\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e(1, 10), (0.5, 1.0), (0.5, 1.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMin child weight; sample fractions.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eCatBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eiterations, depth\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[100, 200], [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eBoosting iterations; tree depth.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003elearning_rate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[0.01, 0.1]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eStep size shrinkage.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eGRU\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003egru_units, dense_units\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[32, 128] (step 16), [10, 64] (step 16)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGRU units; dense layer size.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003edropout_rate, recurrent_dropout_rate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[0.1, 0.5] (step 0.05), [0.0, 0.4] (step 0.05)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eRegularization dropouts.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003elearning_rate, batch_size, l2_reg\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[1e-4, 1e-2] (loguniform), [32, 64, 128], [1e-6, 1e-3] (loguniform)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eOptimization settings.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eCNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003efilters, kernel_size\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[64, 256] (step 32), [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e] (step 2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eConv1D filters; kernel size.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003edense_units, dropout_rate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[16, 128] (step 16), [0.1, 0.5] (step 0.05)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eDense layer size; regularization dropout.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003elearning_rate, batch_size, l2_reg\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[1e-4, 1e-2] (loguniform), [32, 64, 128], [1e-6, 1e-3] (loguniform)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eOptimization settings.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eLSTM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003elstm_units, dense_units\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[32, 128] (step 16), [10, 64] (step 16)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLSTM units; dense layer size.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003edropout_rate, recurrent_dropout_rate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[0.1, 0.5] (step 0.05), [0.0, 0.4] (step 0.05)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eRegularization dropouts.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003elearning_rate, batch_size, l2_reg\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[1e-4, 1e-2] (loguniform), [32, 64, 128], [1e-6, 1e-3] (loguniform)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eOptimization settings.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eBest Hyperparameters for each model after hyperparameter tuning by bayesian optimization\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBest Hyperparameters\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCatBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eborder_count: 113.572, depth: 10.763, iterations: 377.451, l2_leaf_reg: 5.840, learning_rate: 0.096\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e'batch_size': 128, 'filters': 256, 'kernel_size': 7, 'dense_units': 112, 'dropout_rate': 0.2, 'l2_reg': 4.995035980196417e-06, 'learning_rate': 0.0004955676611313104\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDecision Tree\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eccp_alpha: 0.005, criterion: 0.742, max_depth: 18.752, min_samples_leaf: 46.240, min_samples_split: 11.054, splitter: 0.198\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGRU\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ebatch_size': 32, 'gru_units': 64, 'dropout_rate': 0.2, 'recurrent_dropout_rate': 0.0, 'dense_units': 16, 'l2_reg': 1e-4, 'learning_rate': 0.0003}\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e(n_neighbors\u0026thinsp;=\u0026thinsp;19, p\u0026thinsp;=\u0026thinsp;1, weights='distance')\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLogistic Regression\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC: 2.124, penalty: 0.020, solver: 0.970\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLogistic regression with BERT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eC: 1.777, penalty: 0.720, solver: 0.128\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLSTM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ebest_params = {'batch_size': 128, 'lstm_units': 112, 'dropout_rate': 0.25, 'recurrent_dropout_rate': 0.1, 'dense_units': 26, 'l2_reg': 0.0004702818029667774, 'learning_rate': 0.0015\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMLP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eactivation: 0.769, alpha: 0.078, batch_size: 159.662, hidden_layer_sizes: 30.245, learning_rate: 0.886, learning_rate_init: 0.064, solver: 0.042\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRandom forest\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ecriterion: 0.034, max_depth: 14.093, max_features: 0.455, min_samples_leaf: 6.962, min_samples_split: 4.493, n_estimators: 102.007\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ecolsample_bytree: 0.846, gamma: 1.380, learning_rate: 0.229, max_depth: 9.390, min_child_weight: 1.392, n_estimators: 458.533, subsample: 0.899\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost with BERT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ecolsample_bytree: 0.546, gamma: 1.974, learning_rate: 0.149, max_depth: 7.704, min_child_weight: 6.886, n_estimators: 457.855, subsample: 0.919\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eModel training and validation\u003c/h2\u003e \u003cp\u003eThe final four models: LSTM, GRU, CNN, and XGBoost were examined in detail for their training and validation performance. The final four models - LSTM, GRU, CNN, and XGBoost - were examined in detail for their training and validation performance. While LSTM, GRU, and CNN demonstrated a good fit with consistent alignment between training and validation losses, XGBoost showed signs of overfitting, with validation loss diverging from training loss as training progressed. To train the XGBoost model, we utilized stratified data to maintain class balance and prevent bias during training. For the CNN model, both training and validation accuracies increased steadily, eventually plateauing at 0.94 and 0.92, respectively, after 30 epochs (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA). The corresponding training and validation losses decreased consistently, stabilizing around 0.25 and 0.26 (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). The GRU model showed rapid improvements in both training and validation accuracies, which reached plateaus at 0.94 and 0.92 after approximately 20 epochs (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB). Similarly, training and validation losses initially declined before leveling off at 0.24 and 0.25, respectively (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB). For the LSTM model, training and validation accuracies exhibited a steady increase, plateauing at 0.93 and 0.92 after about 30 epochs (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eC). Training and validation losses decreased gradually and stabilized at 0.23 and 0.24 (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC). In the case of XGBoost, training accuracy rose rapidly, peaking at 0.995 after 10 iterations, while validation accuracy fluctuated slightly before stabilizing around 0.93 (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eD). Training loss declined sharply, reaching a plateau at 0.01 after 10 iterations, whereas validation loss showed some variability but eventually settled around 0.2 (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eD). While XGBoost achieved the lowest training loss and highest training accuracy but exhibited a larger gap between training and validation performance, indicating a risk of overfitting. In contrast, CNN, GRU, and LSTM maintained more balanced performance between training and validation, indicating superior generalization.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eModel testing\u003c/h2\u003e \u003cp\u003eThe performance metrics for the 12 models trained initially on 10% training data on the test dataset is provided in Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e7\u003c/span\u003e. The results highlight that, KNN, while simple and interpretable, struggled with capturing complex patterns, reflected in its lower performance compared to the deep learning models. Random Forest showed balanced precision and recall but lacked the higher accuracy seen in the top performers, whereas Logistic regression, despite being one of the simpler models, still outperformed some other traditional models, but its linearity assumption limited its ability to handle complex patterns effectively. The Decision Tree model was notably the weakest, with poor generalization and high sensitivity to noise, as seen in its high false positive and false negative rates. Logistic regression with BERT and XGBoost with BERT embeddings achieved an accuracy of 82% and 81.94%. While BERT embeddings improved feature representation, the model's performance did not significantly surpass traditional machine learning models, suggesting that BERT\u0026rsquo;s contextual embeddings, while useful, did not offer substantial benefits for gender classification tasks based on names. When BERT embeddings were combined with XGBoost, there were no significant improvements in accuracy. Performing well with categorical data, CatBoost, similar to XGBoost, provided robust results without overfitting, although it did not surpass XGBoost in precision or recall.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab7\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 7\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance metrics for the 12 ML models (10% training data) on the test dataset.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel/Metrics\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eF1-Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCatBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDecision Tree\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.72\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.71\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.71\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.71\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGRU\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.86\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.86\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.86\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.86\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLogistic Regression\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLogReg-BERT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.82\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.82\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.82\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.82\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLSTM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMLP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.86\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.86\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.86\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.86\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRandom Forest\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost with BERT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.82\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.82\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.82\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.82\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe four best test models were trained on 100% data and tested, underscoring their ability to allocate gender based on names. The model performance data from the testing experiments reveals significant insights into the effectiveness of various machine learning approaches for gender classification tasks in Table\u0026nbsp;\u003cspan refid=\"Tab8\" class=\"InternalRef\"\u003e8\u003c/span\u003e. The accuracies ranged from 81.97\u0026ndash;93.37% for all the models.The LSTM model achieved the highest accuracy of 93.37%, with balanced precision, recall, and F1 scores for both male and female names. This model required nearly 50 seconds for predictions. Its strength lies in capturing temporal dependencies and complex sequential patterns, contributing to consistent predictions with minimal bias, as shown in the confusion matrix. The CNN model closely followed with an accuracy of 92.99%, with fastest prediction time at just 5.61 seconds. It effectively extracted features from name sequences, capturing spatial hierarchies and complex patterns. However, the confusion matrix indicated slightly higher misclassification rates compared to LSTM, particularly for names with ambiguous gender associations. The GRU model achieved an accuracy of 92.65%, while taking 10.44 seconds for predictions, slightly behind LSTM and CNN. It performed well across both genders, with a precision of 0.94 and recall of 0.90 for male names, and a precision of 0.92 and recall of 0.93 for female names, showing a balanced distribution of false positives and false negatives across genders. GRU\u0026rsquo;s ability to capture sequential dependencies, similar to LSTM. whereas,XGBoost,a traditional machine learning model, achieved an accuracy of 87.24%.While it did not surpass the deep learning models, it showed strong performance, especially for male names with a precision and recall of 0.87. For female names, the precision was 0.86, and recall was 0.87, yielding an F1 score of 0.86 for both genders.While XGBoost demonstrated reasonable predictive performance, it took 8.48 seconds to make predictions. Final evaluation metrics are presented in Table\u0026nbsp;\u003cspan refid=\"Tab8\" class=\"InternalRef\"\u003e8\u003c/span\u003e, showcasing their performance across diverse scenarios.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab8\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 8\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance metrics for the final models trained on full data.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"9\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eModels/metrics\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e \u003cp\u003eF1 score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c8\" namest=\"c7\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eTime take for test results\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.86\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.86\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.88\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.86\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.87\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e8.46s\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGRU\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e10.44s\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLSTM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e49.99s\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.96\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e5.61s\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eConfusion matrix\u003c/h2\u003e \u003cp\u003eThe confusion matrix evaluates classification performance by identifying true positives, false positives, true negatives, and false negatives.This analysis plays a crucial role in measuring model accuracy and reliability. For model evaluation, the performance metrics accuracy, precision, recall, and F1-score were analyzed following the methodology of Ghate et al., 2024.\u003c/p\u003e \u003cp\u003eFigure S3 presents the confusion matrices for the initial 12 models on 10% data, highlighting their performance on male and female name classification.While models including Log-BERT and XGBoost showed balanced classification, while the poorest overall performance was demonstrated by Decision Tree model showing the highest misclassification rate for female names. Based on these results, the top four performing models were selected for further testing and comparison was provided as confusion matrices (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Each model's accuracy, precision, recall, and confusion matrix provide insights into their effectiveness in predicting class labels. The XGBoost model achieved an accuracy of 0.8647 in correctly predicting class labels, indicating correct predictions in 86.47% of cases (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eD). However, it had the lowest overall performance among the top models. In contrast, the LSTM model demonstrated robust performance with an accuracy of 94.0%, effectively identifying names across both classes, as reflected in its confusion matrix (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eC). The GRU model also performed well, achieving an accuracy of 0.95. However, it has classified male and female subjects better than the other models based on the confusion matrix (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eB).This performance highlights the GRU's capability to capture most actual positive instances while still exhibiting notable error rates. Finally,the CNN model outperformed all others with an accuracy of 96.0%, achieving precision and recall scores of 0.96 for both male and female classes. Its confusion matrix (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA) highlights its ability to consistently and accurately classify names across categories, making it the most effective model overall.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe results of this study demonstrate the effectiveness of CNN, LSTM, GRU,and XGBoost models in gender classification tasks, with each model showcasing distinct strengths. Among the models, CNN achieved the highest overall accuracy of 96%, followed by LSTM and GRU at 95% and 93% respectively. XGBoost, while effective as a traditional machine learning model, trailed slightly behind with an accuracy of 86%. These findings highlight the superiority of deep learning models in leveraging hierarchical and sequential patterns in names for gender classification.\u003c/p\u003e \u003cp\u003eCNN’s performance can be attributed to its ability to learn hierarchical representations of character sequences, capturing both local and global patterns in names (Choudhary et al., 2021; Gao et al., 2023). Its high accuracy (96%) reflects its robustness in extracting substrings, phonetic structures, and cultural naming conventions that correlate with gender (Pramanik \u0026amp; Bag, 2021, Sharma et al., 2021). Additionally, CNN achieved the fastest test time of 5.61 seconds, further solidifying its efficiency in large-scale applications.These results align with prior research, such as Rego et al. (2021), which demonstrated the utility of CNNs in text classification. CNN’s computational efficiency, due to features like weight sharing and pooling layers, allowed it to generalize effectively without overfitting.\u003c/p\u003e \u003cp\u003eLSTM and GRU models,with their sequential processing capabilities,also delivered strong performance. Both models demonstrated high precision, recall, and F1 scores of 0.95 for male and female classifications, indicating their ability to capture long-term dependencies in character sequences. LSTM's slightly higher accuracy (95%) compared to GRU (93%) may be attributed to its advanced gating mechanisms which are more effective at retaining relevant information over longer sequences. However, LSTM has significantly higher computational time (49.99 seconds) compared to GRU (10.44 seconds) and CNN suggests it is less suitable for real-time applications. These results are consistent with findings from Greff et al. (2016) and Lu \u0026amp; Salem (2017), where LSTMs were noted for their precision but requiring additional computational intensity.\u003c/p\u003e \u003cp\u003eXGBoost, despite its lower accuracy of 86%, demonstrated competitive performance in handling imbalanced datasets and produced reasonably balanced precision and recall values for male and female classifications. However, its F1 score of 0.87 and relatively slower test time of 8.46 seconds highlight its limitations compared to deep learning models. XGBoost struggled particularly with female name classification, indicating potential biases in feature representation. These results align with findings from Zhang et al. (2022), where XGBoost was noted for its effectiveness in high-dimensional feature spaces but limited performance in complex, sequential tasks.\u003c/p\u003e \u003cp\u003eThe precision, recall, and F1 scores for all models reveal the strengths and limitations of their architectures. CNN, LSTM, and GRU all achieved balanced precision and recall values for both male and female classifications, demonstrating their ability to handle gender-ambiguous names. CNN, in particular, exhibited the lowest misclassification rates, reinforcing its ability to generalize across diverse naming conventions. XGBoost, while strong in male classifications, exhibited higher false-positive rates for female names, suggesting the need for further feature engineering or additional linguistic input. The confusion matrices of our top-performing models further highlight the strengths and limitations of each approach. CNN exhibited the lowest misclassification rates, particularly for ambiguous names, reinforcing its superior ability to generalize across diverse naming conventions. LSTM and GRU, while highly accurate, showed slightly higher false-positive rates compared to CNN, suggesting potential sensitivity to certain naming patterns. XGBoost, despite being robust for male name classification, struggled more with female names, indicating a potential bias in feature representation that future studies could address.\u003c/p\u003e \u003cp\u003eCompared to existing studies, this work provides a comprehensive evaluation of model performance using a large and balanced dataset. While traditional models like Random Forest and Logistic Regression have shown F1 scores around 87% for similar tasks (Pham \u0026amp; Nguyen, 2023), our study highlights the clear advantage of deep learning models. Furthermore, the limited impact of advanced embeddings like BERT in this study underscores the simplicity of name-based datasets, where character-level patterns suffice for achieving high accuracy. Each model's architecture is uniquely suited to handle sequential and complex data patterns, positioning them as powerful tools for gender prediction and related applications in natural language processing.\u003c/p\u003e "},{"header":"Limitations","content":"\u003cp\u003eThis study highlights the potential of machine learning models for automating gender classification of names, but it also underscores certain limitations.The dataset used in this study underrepresents North-East Indian names, which could result in biased models with limited generalizability across India's diverse population. Additionally, the models struggled with accurately classifying unisex names, leading to a higher likelihood of misclassification in such cases.While several models achieved high accuracy, significant rates of false positives and false negatives were observed, which could have critical implications in real-world applications where precision is essential. Furthermore, prediction times varied considerably among the models; for instance, the LSTM model, despite its strong performance had significantly slower prediction times (50 seconds) compared to CNN (5.61 seconds), potentially limiting its use in time-sensitive scenarios.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study evaluated several machine learning and deep learning models for gender classification based on names, focusing on the accuracy, precision and recall of each model.These findings have significant implications for gender classification particularly in analyzing gender bias in sectors such as technology, media, academia, and business.The effectiveness of deep learning models in predicting gender from names offers a strong foundation for developing automated gender analysis tools. Among the models, CNN emerged as the most efficient and accurate achieving a 96% accuracy with the fastest prediction time of 5.61 seconds. LSTM followed closely with an accuracy of 95%, albeit with a much slower prediction time of 50 seconds. Both models demonstrated consistently high precision and recall across gender classes effectively identifying patterns in name data.These results highlight the potential of machine learning and deep learning models in gender classification of Indian names, offering a foundation for the development of automated tools in computational social science and other practical applications. Future research should focus on refining these models to enhance computational efficiency, reduce error rates, and address biases. Expanding the dataset to include more diverse linguistic and cultural representation particularly from underrepresented regions like North-East India will further improve model generalizability and robustness.\u003c/p\u003e \u003cdiv id=\"Sec23\" class=\"Section2\"\u003e \u003ch2\u003eCRediT authorship contribution statement\u003c/h2\u003e \u003cp\u003e \u003cb\u003eSDG\u003c/b\u003e: Conceptualization, Resources, Software, Validation, Methodology, Supervision, Writing - original draft, Writing - review \u0026amp; editing. \u003cb\u003eSH\u003c/b\u003e: Data curation, Writing - original draft, Visualization. \u003cb\u003eDDG\u003c/b\u003e: Data curation, Methodology, Software, Validation, Writing - original draft. \u003cb\u003eAM\u003c/b\u003e: Data curation, Methodology, Software, Visualization. \u003cb\u003eAA\u003c/b\u003e: Data curation, Writing - original draft. \u003cb\u003eND\u003c/b\u003e: Writing - review \u0026amp; editing. \u003cb\u003ePP\u003c/b\u003e: Writing - review \u0026amp; editing.\u003c/p\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eCRediT authorship contribution statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSDG\u003c/strong\u003e: Conceptualization, Resources, Software, Validation, Methodology, Supervision, Writing - original draft, Writing - review \u0026amp; editing. \u0026nbsp;\u003cstrong\u003eSH\u003c/strong\u003e: Data curation, Writing - original draft, Visualization. \u003cstrong\u003eDDG\u003c/strong\u003e: Data curation, Methodology, Software, Validation, \u0026nbsp; Writing - original draft. \u003cstrong\u003eAM\u003c/strong\u003e: Data curation, Methodology, Software, Visualization. \u003cstrong\u003eAA\u003c/strong\u003e: Data curation, Writing - original draft. \u003cstrong\u003eND\u003c/strong\u003e: Writing - review \u0026amp; editing. \u0026nbsp; \u003cstrong\u003ePP\u003c/strong\u003e: Writing - review \u0026amp; editing.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDeclaration of Competing Interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDeclaration of generative AI and AI-assisted technologies in the writing process\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eDuring the preparation of this work, the author(s) used [QuillBot] to improve clarity, engagement, and grammar. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe synthetic name generator code used in this study is publicly available on GitHub at github.com/ghatesudi/synthetic_names and can be freely accessed under the MIT License. Researchers interested in accessing the name predictor ML model may contact the corresponding author for further information or access instructions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eORCID ID\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eDhanush Ghate D\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; https://orcid.org/0009-0004-9143-1267 \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eSaishma H \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; https://orcid.org/0009-0003-4454-3584\u003c/p\u003e\n\u003cp\u003eAdithya M \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; https://orcid.org/0009-0004-2073-8210\u003c/p\u003e\n\u003cp\u003eSudeep D. Ghate \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp; \u0026nbsp; https://orcid.org/0000-0001-9996-3605\u003c/p\u003e\n\u003cp\u003eNeevan D\u0026rsquo;Souza\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; https://orcid.org/0000-0002-4043-9638\u003c/p\u003e\n\u003cp\u003eAnjusha Alex\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp; https://orcid.org/0009-0004-2899-9554\u003c/p\u003e\n\u003cp\u003ePrakash Patil \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;https://orcid.org/0000-0002-1263-8517\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eTripathi A, Faruqui M (2011) Gender prediction of Indian names. In IEEE Technology Students' Symposium (pp. 137\u0026ndash;141). IEEE\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDeTienne DR, Chandler GN (2007) The role of gender in opportunity identification. Entrepreneurship theory Pract 31(3):365\u0026ndash;386\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRadhakrishnan S (2011) Appropriately Indian: Gender and culture in a new transnational class. Duke University Press\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHu Y, Hu C, Tran T, Kasturi T, Joseph E, Gillingham M (2021) What\u0026rsquo;s in a name? \u0026ndash; gender classification of names with character-based machine learning models. Data Min Knowl Disc 35(4):1537\u0026ndash;1563\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAmarappa S, Sathyanarayana S (2015) Kannada named entity recognition and classification (NERC) based on multinomial na\u0026iuml;ve Bayes classifier. Int J Nat Lang Comput, 4\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJia Y, Zhao Y (2019) Gender identification in Chinese names. Lingua 234:102759\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eManik LP, Syafiandini AF, Mustika HF, Akbar Z, Rianto Y (2019) Gender inference based on Indonesian name and profile photo. In 2019 International Conference on Computer, Control, Informatics and its Applications (IC3INA) (pp. 25\u0026ndash;29). IEEE\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRego RC, Silva VM (2021) Predicting gender of Brazilian names using deep learning. arXiv:2106.10156.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSun, Z., Li, X., Sun, X., Meng, Y., Ao, X., He, Q., \u0026hellip; Li, J. (2021). Chinesebert:Chinese pretraining enhanced by glyph and pinyin information. arXiv preprint arXiv:2106.16038.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSingh KS (1996) Communities, segments, synonyms, surnames and titles, vol 8. Oxford University Press, USA\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSharma DD (2005) Panorama of Indian Anthroponomy: An Historical, Socio-cultural \u0026amp; Linguistic Analysis of Indian Personal Names. Mittal\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., \u0026hellip; Zheng, X. (2016).{TensorFlow}: a system for {Large-Scale} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16)(pp. 265\u0026ndash;283).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSammut C, Webb GI (eds) (2011) Encyclopedia of machine learning. Springer Science \u0026amp; Business Media\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCavnar WB, Vayda AJ (1992) Using superimposed coding of N-gram lists for Efficient Inexact Matching, Proceedings of the Fifth USPS Advanced Technology Conference, Washington D.C\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang Z (2015) The singular value decomposition, applications and beyond. \u003cem\u003earXiv preprint arXiv:1510.08532\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLef\u0026egrave;vre S, Piantanida P (2016) Pytesseract: A Python wrapper for Google Tesseract-OCR. J Open-Source Softw 1(8):72\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFenniak M, Stamy M, Thoma M, Peveler M (2024) The pypdf library. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://pypi.org/project/pypdf/\u003c/span\u003e\u003cspan address=\"https://pypi.org/project/pypdf/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRichardson L (2007) Beautiful soup documentation. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://beautiful-soup-4.readthedocs.io/en/latest/\u003c/span\u003e\u003cspan address=\"https://beautiful-soup-4.readthedocs.io/en/latest/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eD\u0026rsquo;Silva J, Sharma U (2022) Automatic text summarization of konkani texts using pre trained word embeddings and deep learning. Int J Electr Comput Eng 12(2):1990\u0026ndash;2000. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.11591/ijece.v12i2.pp1990-2000\u003c/span\u003e\u003cspan address=\"10.11591/ijece.v12i2.pp1990-2000\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGhosh S (2021) Identifying click baits using various machine learning and deep learning techniques. Int J Inform Technol (Singapore) 13(3):1235\u0026ndash;1242. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s41870-020-00473-1\u003c/span\u003e\u003cspan address=\"10.1007/s41870-020-00473-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQureshi KA, Sabih M (2021) Un-Compromised Credibility: Social Media Based Multi-Class Hate Speech Classification for Text. IEEE Access 9:109465\u0026ndash;109477. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ACCESS.2021.3101977\u003c/span\u003e\u003cspan address=\"10.1109/ACCESS.2021.3101977\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRehman TU, Mahmud MS, Chang YK, Jin J, Shin J (2019) Current and future applications of statistical machine learning algorithms for agricultural machine vision systems. Comput Electron Agric 156:585\u0026ndash;605. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.compag.2018.12.006\u003c/span\u003e\u003cspan address=\"10.1016/j.compag.2018.12.006\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSyriopoulos PK, Kalampalikis NG, Kotsiantis SB, Vrahatis MN (2023) kNN Classification: a review. Ann Math Artif Intell 1\u0026ndash;33. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/S10472-023-09882-X/METRICS\u003c/span\u003e\u003cspan address=\"10.1007/S10472-023-09882-X/METRICS\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eT\u0026uuml;fekci P, Bektaş M (2022) Author and genre identification of Turkish news texts using deep learning algorithms. \u003cem\u003eSadhana - Academy Proceedings in Engineering Sciences\u003c/em\u003e, \u003cem\u003e47\u003c/em\u003e(4), 194. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s12046-022-01975-3\u003c/span\u003e\u003cspan address=\"10.1007/s12046-022-01975-3\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZou J, Yuan C, Zhang X, Zou G, Wan ATK (2023) Model averaging for support vector classifier by cross-validation. Stat Comput 33(5):1\u0026ndash;22. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/S11222-023-10284-6/\u003c/span\u003e\u003cspan address=\"10.1007/S11222-023-10284-6/\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRego RC, Silva VM, Fernandes VM (2021) Predicting gender by first name using character-level machine learning. arXiv preprint arXiv :210610156\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePham D, Nguyen L (2023), December Gendec: A Machine Learning-Based Framework for Gender Detection from Japanese Names. In International Conference on Intelligent Systems Design and Applications (pp. 235\u0026ndash;244). Cham: Springer Nature Switzerland\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGhosh S, Tyagi U, Suri M, Kumar S, Ramaneswaran S, Manocha D (2023) ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER. arXiv preprint arXiv:2306.00928. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2306.00928\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2306.00928\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGao G, Yu Y, Yang J, Qi GJ, Yang M (2020) Hierarchical deep CNN feature set-based representation learning for robust cross-resolution face recognition. IEEE Trans Circuits Syst Video Technol 32(5):2550\u0026ndash;2560\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChoudhury A, Sarma KK (2021) A CNN-LSTM based ensemble framework for in-air handwritten Assamese character recognition. Multimedia Tools Appl 80(28):35649\u0026ndash;35684\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSharma A, Lysenko A, Boroevich KA, Vans E, Tsunoda T (2021) DeepFeature: feature selection in nonimage data using convolutional neural network. Brief Bioinform 22(6):bbab297\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePramanik R, Bag S (2021) Handwritten Bangla city name word recognition using CNN-based transfer learning and FCN. Neural Comput Appl 33:9329\u0026ndash;9341\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLu Y, Salem FM (2017), August Simplified gating in long short-term memory (lstm) recurrent neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS) (pp. 1601\u0026ndash;1604). IEEE\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang P, Jia Y, Shang Y (2022) Research and application of XGBoost in imbalanced data. Int J Distrib Sens Netw 18(6):15501329221106935\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eProkhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) CatBoost: unbiased boosting with categorical features. \u003cem\u003eAdvances in neural information processing systems\u003c/em\u003e, \u003cem\u003e31\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278\u0026ndash;2324\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKenton JDMWC, Toutanova LK (2019), June Bert: Pre-training of deep bidirectional transformers for language understanding. In \u003cem\u003eProceedings of naacL-HLT\u003c/em\u003e (Vol. 1, p. 2)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTrendowicz A, Jeffery R, Trendowicz A, Jeffery R (2014) Classification and regression trees. \u003cem\u003eSoftware Project Effort Estimation: Foundations and Best Practice Guidelines for Success\u003c/em\u003e, 295\u0026ndash;304\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen T, Guestrin C (2016), August Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785\u0026ndash;794)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHochreiter S, Schmidhuber J (1997) Long Short-Term Memory, in Neural Computation, vol. 9, no. 8, pp. 1735\u0026ndash;1780, 15 Nov. 1997. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1162/neco.1997.9.8.1735\u003c/span\u003e\u003cspan address=\"10.1162/neco.1997.9.8.1735\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBreiman L (2001) Random forests. Mach Learn 45:5\u0026ndash;32\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCho K, Van Merri\u0026euml;nboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. \u003cem\u003earXiv preprint arXiv:1406.1078\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGhate D, Saishma H, Adithya M, Ghate SD (2025) Advancing Arecanut Quality Grading: A Comparative Analysis of YOLO Models with Hyperparameter Optimization. PREPRINT (Version 1) available at Research Square [\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.21203/rs.3.rs-5755373/v1]\u003c/span\u003e\u003cspan address=\"10.21203/rs.3.rs-5755373/v1]\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHunter JD (2007) Matplotlib: A 2D graphics environment. Comput Sci Eng 9(3):90\u0026ndash;95\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWaskom M et al (2020) Seaborn: statistical data visualization. J Open Source Softw 5(51):241\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVirtanen P et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17(3):261\u0026ndash;272\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePanchenko A, Teterin A (2014), April Detecting gender by full name: experiments with the Russian language. In \u003cem\u003eInternational Conference on Analysis of Images, Social Networks and Texts\u003c/em\u003e (pp. 169\u0026ndash;182). Cham: Springer International Publishing\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSantosh TYSS, Sanyal DK, Das PP (2020) Person Name Segmentation with Deep Neural Networks. In \u003cem\u003eMining Intelligence and Knowledge Exploration: 7th International Conference, MIKE 2019, Goa, India, December 19\u0026ndash;22, 2019, Proceedings 7\u003c/em\u003e (pp. 32\u0026ndash;41). Springer International Publishing\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKabir MH, Ahmad F, Hasan MAM, Shin J (2022) Gender Recognition of Bangla Names Using Deep Learning Approaches. Appl Sci 13(1):522\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGreff K, Srivastava RK, Koutn\u0026iacute;k J, Steunebrink BR, Schmidhuber J (2016) LSTM: A search space odyssey. IEEE Trans neural networks Learn Syst 28(10):2222\u0026ndash;2232\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Nitte University","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Gender classification, Indian names, Machine learning, Feature extraction, Deep learning, Dataset diversity","lastPublishedDoi":"10.21203/rs.3.rs-5897194/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5897194/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eClassifying gender based on Indian names poses a unique challenge due to the nation's immense cultural, linguistic, and regional diversity. Existing methods often struggle to address the complexities of naming conventions shaped by religious, familial, and linguistic influences, resulting in inconsistent and inaccurate classifications. To address these challenges, this study developed a culturally diverse dataset of 31.3 lakh male and female names and leveraged advanced machine learning (ML) and deep learning (DL) techniques for gender classification. These names were sourced from Indian electoral data, synthetic names generated using custom scripts, and publicly available names from websites to ensure diversity. Twelve ML models were evaluated, with the top four - Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and XGBoost\u0026mdash;prioritized for detailed analysis.\u003c/p\u003e \u003cp\u003eCNN emerged as the best-performing model, achieving the highest accuracy (96%) and the fastest prediction time (5.61 seconds), highlighting its efficiency and ability to generalize across diverse naming conventions. LSTM and GRU also demonstrated strong performance, achieving accuracies of 95% and 93% respectively, with LSTM offering higher precision but significantly longer prediction times (50 seconds). XGBoost, a traditional ML model, achieved an accuracy of 86% but struggled with female name classification, indicating potential biases in feature representation. All models effectively captured complex naming patterns, though challenges such as the misclassification of unisex names and the underrepresentation of North-East Indian names in the dataset highlighted areas for improvement.\u003c/p\u003e \u003cp\u003eThis study underscores the advantages of deep learning models, particularly CNN, in leveraging hierarchical and sequential patterns in names for robust gender classification. However, limitations in dataset diversity and model generalizability indicate the need for further refinement. These findings contribute to advancing automated gender classification systems, offering practical applications in healthcare, marketing, and social sciences. Future work should focus on enhancing computational efficiency, expanding datasets to improve cultural inclusivity, and addressing biases to ensure equitable ML innovations.\u003c/p\u003e","manuscriptTitle":"Decoding Gender: A Machine Learning Approach for Classifying Indian Names with Advanced Feature Extraction","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-01-29 08:05:44","doi":"10.21203/rs.3.rs-5897194/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"29cf4949-0f02-49cf-a5ef-29e5228b426b","owner":[],"postedDate":"January 29th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":43382294,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-01-29T08:05:44+00:00","versionOfRecord":[],"versionCreatedAt":"2025-01-29 08:05:44","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-5897194","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5897194","identity":"rs-5897194","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00