hdx-python-model

Jupyter Notebook of the Model

The Python file contains the ML model used by the API. The output of the model generates a pickle file (for the API) and top 3 predicted tags for a given header with their respective confidence levels.

To run the python notebook, follow the instructions below:

Optional: Create a virtual environment and install an ipython kernel
- Plain python:
  - python -m venv .venv
  - .venv\Scripts\activate[.ps1 or other appropriate suffix]
  - python -m ipykernel install --user --name hdx-model
- Conda:
  - conda create -n hdx-model
  - source activate hdx-model
  - python -m ipykernel install --name hdx-model
Install requirements
- fasttext requires a C++ compiler. On Windows platforms this will require an additional installation. Mac and Linux platforms should have a compiler pre-installed.
- pip install -r requirements.txt
Make sure that your directory includes the training dataset 'headertag_fulldataset.xlsx', which includes the pruned data of all the tagged datasets in HDX as of 3/15/2019.
Open the notebook in jupyter. Select the hdx-model kernel that we created earlier.
Make sure to have the correct parameters for the model given in the first cell of the python file:
- create_dataset (default: False) - boolean to determine whether the user want to download from HDX database vs. pre-loaded excel file
- SAMPLE_NUMBER_OF_DATASETS (default: 150) - number of training datasets to scrape from HDX database
Run the notebook. It will take at least one major setup action:
- Downloading the pre-trained word vectors if they do not exist. This is an ~8GB download.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
archive		archive
utilities		utilities
.gitignore		.gitignore
README.md		README.md
Toggled MLPClassifier Parameters.csv		Toggled MLPClassifier Parameters.csv
attribute_model.pkl		attribute_model.pkl
headertag_fulldataset.xlsx		headertag_fulldataset.xlsx
model.pkl		model.pkl
model_final.ipynb		model_final.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archive

archive

utilities

utilities

.gitignore

.gitignore

README.md

README.md

Toggled MLPClassifier Parameters.csv

Toggled MLPClassifier Parameters.csv

attribute_model.pkl

attribute_model.pkl

headertag_fulldataset.xlsx

headertag_fulldataset.xlsx

model.pkl

model.pkl

model_final.ipynb

model_final.ipynb

requirements.txt

requirements.txt

Repository files navigation

hdx-python-model

About

Releases

Packages

Languages

dividor/hdx-python-model

Folders and files

Latest commit

History

Repository files navigation

hdx-python-model

About

Resources

Stars

Watchers

Forks

Languages