Skip to content

ataki/deep-learning-gender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

blog-gender-dataset

Maintains dataset generation procedure for our deep-learning project.

Author: Jim Zheng, Aric Bartle

Reduced Vocab

  • download frequency data
  • prune data to get top N%
  • output (word-vec => word mapping)
  • wordvector.txt, vocab.txt, vocab.pdb

Blog Cleanup

  • go through each blog
    • remove unicode
    • extract words without punctuation
    • all lowercase
    • num => DG
    • unknown vocab => UUNNGG
    • have param k that specifies max sent per ex