Decoding Gender: A Machine Learning Approach for Classifying Indian Names with Advanced Feature Extraction

Sudeep D. Ghate
Saishma H
Dhanush Ghate D
Adithya M
Anjusha Alex
Neevan D’Souza
Prakash Patil

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Classifying gender based on Indian names poses a unique challenge due to the nation's immense cultural, linguistic, and regional diversity. Existing methods often struggle to address the complexities of naming conventions shaped by religious, familial, and linguistic influences, resulting in inconsistent and inaccurate classifications. To address these challenges, this study developed a culturally diverse dataset of 31.3 lakh male and female names and leveraged advanced machine learning (ML) and deep learning (DL) techniques for gender classification. These names were sourced from Indian electoral data, synthetic names generated using custom scripts, and publicly available names from websites to ensure diversity. Twelve ML models were evaluated, with the top four - Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and XGBoost—prioritized for detailed analysis. CNN emerged as the best-performing model, achieving the highest accuracy (96%) and the fastest prediction time (5.61 seconds), highlighting its efficiency and ability to generalize across diverse naming conventions. LSTM and GRU also demonstrated strong performance, achieving accuracies of 95% and 93% respectively, with LSTM offering higher precision but significantly longer prediction times (50 seconds). XGBoost, a traditional ML model, achieved an accuracy of 86% but struggled with female name classification, indicating potential biases in feature representation. All models effectively captured complex naming patterns, though challenges such as the misclassification of unisex names and the underrepresentation of North-East Indian names in the dataset highlighted areas for improvement. This study underscores the advantages of deep learning models, particularly CNN, in leveraging hierarchical and sequential patterns in names for robust gender classification. However, limitations in dataset diversity and model generalizability indicate the need for further refinement. These findings contribute to advancing automated gender classification systems, offering practical applications in healthcare, marketing, and social sciences. Future work should focus on enhancing computational efficiency, expanding datasets to improve cultural inclusivity, and addressing biases to ensure equitable ML innovations.

Version published to 10.21203/rs.3.rs-5897194/v1 on Research Square
Jan 29, 2025

A Dual-Architecture Deep Learning Pipeline for Real-Time High-Accuracy Arabic Sign Language Recognition

This article has 3 authors:
1. Asmaa Youssef
2. Amira Gaber
3. Shereen M. El-Metwally
This article has no evaluationsLatest version Feb 4, 2026
Deep Learning Based Bi-Directional LSTM for Sentiment Analysis of Health App Reviews

This article has 4 authors:
1. Linda Varghese
2. Rajesh R Pai
3. Shavantrevva Bilakeri
4. Naganna Chetty
This article has no evaluationsLatest version Feb 16, 2026
A Machine Learning Approach for Nominative Record Linkage in Chinese Historical Databases

This article has 4 authors:
1. Bruce Yu
2. Yueran Hou
3. Yibei Wu
4. Cameron Campbell
This article has no evaluationsLatest version Mar 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Dual-Architecture Deep Learning Pipeline for Real-Time High-Accuracy Arabic Sign Language Recognition

Deep Learning Based Bi-Directional LSTM for Sentiment Analysis of Health App Reviews

A Machine Learning Approach for Nominative Record Linkage in Chinese Historical Databases