Skip to content

BerntA/IR-SMART

Repository files navigation

IR-SMART (TEAM-002)
Information Retrieval - Semantic Answer Type Prediction

Table of Contents



About The Project

IR-SMART contain the generated code for a university project located here.

Given a query formatted in natural language, the code should be able to predict the expected answer type from a set of candidate entitites from collected target ontology. In this project the target ontology used is from the DBpedia 2016 dump.


Built With

The project has utilized the following tools and libraries extensively:


Getting Started

To get a local copy up and running follow these simple steps. It is assumed that the user has jupyter notebook available, and it is recommended to use a Conda distribution(Anaconda/Miniconda).


Prerequisites

Install the necessary python libraries(if conda is not used):

pip install --upgrade elasticsearch gensim numpy scipy scikit-learn

Other dependencies might exist, but they have been installed through conda-distribution


Dataset

Due to overall size of dataset this has be downloaded separately:

  1. DBpedia long_abstract_en.ttl
  2. DBpedia instance_types_en.ttl
  3. SeMantic AnsweR Type dataset
  4. GloVe Wikipedia 2014 + Gigaword 5 pretrained embeddings

File structure

Once all the files has been downloaded, extract them and place them in such a way that the directory structure is as follows(the files highlighted with ## are the files you need to download&place yourself):

📦IR-SMART
 ┣ 📂datasets
 ┃ ┣ 📂DBpedia
 ┃ ┃ ┣ 📜instance_types_en.ttl ##
 ┃ ┃ ┣ 📜long_abstracts_en.ttl ##
 ┃ ┃ ┣ 📜smarttask_dbpedia_test_questions.json ##
 ┃ ┃ ┗ 📜smarttask_dbpedia_train.json ##
 ┃ ┣ 📂gensim
 ┃ ┃ ┗ 📜...
 ┃ ┗ 📂glove
 ┃   ┣ 📜glove.6B.100d.txt ##
 ┃   ┣ 📜glove.6B.200d.txt ##
 ┃   ┣ 📜glove.6B.300d.txt ##
 ┃   ┗ 📜glove.6B.50d.txt  ##
 ┣ 📂results
 ┃ ┣ 📜advanced.csv
 ┃ ┣ 📜advanced_word2vec.csv
 ┃ ┣ 📜baseline.csv
 ┃ ┗ 📜test_type_predictions.csv
 ┣ 📜.gitignore
 ┣ 📜baseline_variable_test.ipynb
 ┣ 📜evaluation.ipynb
 ┣ 📜indexer.ipynb
 ┣ 📜indexer_compact.ipynb
 ┣ 📜LICENSE
 ┣ 📜README.md
 ┗ 📜trial_and_error.ipynb

The necessary code to execute is located in indexer_compact.ipynb and evaluation.ipynb

The other ipynb-files, contain an alternative larger index(indexer.ipynb), tests to see how varying parameter values affected the score(baseline_variable_test). trial_and_error contain a failed early attempt to make the ES-indexing more effective by first loading all datafiles into memory and then initializing ES-indexing(not recommended to run)


Final Steps

  • Execute all cells within indexer_compact.ipynb, this will generate the ElasticSearch index necessary for all consecutive steps.

    • PS: Ensure that Elasticsearch is running either as a systemd-process(linux), or that the bat-file is running(Windows)
    • PS: You will have to uncomment the functioncall createTheIndex(), in cell 5 to generate the index, and indexData(10000) mear the bottom of the file.
  • Execute all cells within evaluation.ipynb, this will perform the evaluation using both the baseline and advanced implementation.

    • PS: Uncomment the convertGlovetoGensim() function call in cell 5, this is necessary to allow GenSim to parse the GloVe embedding-file.

Result

The achieved accuracy scores has been summarized in the table below:

Method Accuracy NDCG@5 NDCG@10
Strict Baseline 0.492 0.237 0.323
Lenient Baseline 0.492 0.312 0.414
Strict Word2Vec 0.522 0.280 0.367
Lenient Word2Vec 0.522 0.364 0.455
Strict LTR(pointwise) 0.776 0.731 0.754
Lenient LTR(pointwise 0.776 0.753 0.780

Contributors


License

Distributed under the GPL-3.0 License. See LICENSE for more information.


Contact

Bernt Andreas Eide

Stian Seglem Bjåland