This repo contains raw data and codes to construct a gender classifier based on the first name. Also, the link, files, and codes of a shiny app where you could use the model for inference purposes.
In clf folder, you will find the data used to train and test the model and the R code with four classifications with their hyperparameter tunning using cv with three folds.
The preprocessing only includes lowercase and removes punctuation from the text. I used glove embeddings from thetextdata package. Based on metrics, I picked an SVM as the best classifier. The AUC from my model is equal to 0.84 with an accuracy of 0.8. Below you can take a look at the ROC curve.
It is worth mentioning that the XGB classifier has similar metrics AUC equal to 0.838 to and accuracy of 0.793, followed by Random Forest with AUC equal to 0.793 and an accuracy of 0.762. Finally, it is a naivebayes classifier with AUC equal to 0.733 and accuracy of 0.561.
This folder includes the UI and server codes to deploy a shinyApp with the model. Remember that you have to save the model as a .rds file and save it in this folder for work on your machine.
Future work could include trying different length embeddings and other classifiers.