Decoding Gender: A Machine Learning Approach for Classifying Indian Names with Advanced Feature Extraction
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Classifying gender based on Indian names poses a unique challenge due to the nation's immense cultural, linguistic, and regional diversity. Existing methods often struggle to address the complexities of naming conventions shaped by religious, familial, and linguistic influences, resulting in inconsistent and inaccurate classifications. To address these challenges, this study developed a culturally diverse dataset of 31.3 lakh male and female names and leveraged advanced machine learning (ML) and deep learning (DL) techniques for gender classification. These names were sourced from Indian electoral data, synthetic names generated using custom scripts, and publicly available names from websites to ensure diversity. Twelve ML models were evaluated, with the top four - Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and XGBoost—prioritized for detailed analysis. CNN emerged as the best-performing model, achieving the highest accuracy (96%) and the fastest prediction time (5.61 seconds), highlighting its efficiency and ability to generalize across diverse naming conventions. LSTM and GRU also demonstrated strong performance, achieving accuracies of 95% and 93% respectively, with LSTM offering higher precision but significantly longer prediction times (50 seconds). XGBoost, a traditional ML model, achieved an accuracy of 86% but struggled with female name classification, indicating potential biases in feature representation. All models effectively captured complex naming patterns, though challenges such as the misclassification of unisex names and the underrepresentation of North-East Indian names in the dataset highlighted areas for improvement. This study underscores the advantages of deep learning models, particularly CNN, in leveraging hierarchical and sequential patterns in names for robust gender classification. However, limitations in dataset diversity and model generalizability indicate the need for further refinement. These findings contribute to advancing automated gender classification systems, offering practical applications in healthcare, marketing, and social sciences. Future work should focus on enhancing computational efficiency, expanding datasets to improve cultural inclusivity, and addressing biases to ensure equitable ML innovations.