https://www.kaggle.com/c/quora-question-pairs
The submission had a log loss score of 0.13499 on the private leader board and 0.13264 on the public leader board
Following packages specified in requirements.txt file need to be installed - Distance, fuzzywuzzy, keras, networkx, nltk, numpy, pandas and scikit-learn
Create folders named data and predictions in the project directory
Download the competition data files from Kaggle Competition Data Link and place them in the data folder
Download the glove vector into project directory Pre-Trained Glove Word Vector
Run the __main__.py file. It is advised to run the code in pieces. For reference, it took me about 2 days on my HP Spectre i5
- Model uses in total 25 nlp and non-nlp features
- A 10-fold validation strategy was used
- A Glove embedding vector and a LSTM model was used to get predictions
- Rare words in questions were replaced by an invalid word indicator "memento"
- Average ensembling is used to derive final model predictions