DeepScite takes in papers (titles, abstracts) and emits recommendations on whether or not they should be scited by the particular users whose data we've used for training (in the case of this repo, it is me).
As output, it also gives a "goodness" score for each word; when this number is high, it has contributed strongly to the paper being (recommended) for sciting, when it is negative, it has contributed strongly to the paper not being recommended.
Below are some example outputs of the system:
The blue text are those words which are "good", and the red text are those which are "bad".
- Clone this repository:
git clone https://github.com/silky/deep-scite.git
-
Use conda or (virtualenv) and create an environment that has Python 3.5.
conda create -n deep-scite python=3.5
-
Activate the environment
source activate deep-scite
-
Install the requirements
pip install -r requirements.txt
- Install
nltk
language packs
In order to tokenise strings, we use the nltk
package. It requires
us to download some data before using it though. To do so, run:
python -c 'import nltk; nltk.download("punkt")'
- Install this library in
develop
mode
python setup.py develop
From the root directory of this project:
- Activate the
deep-scite
environment
source activate deep-scite
- Train the model on the
noon
data set, and emit recommendations
./bin/run_model.py
This will run through the steps defined in model.yaml
.
- Open up
./data/noon/report.html
in your browser and observe recommendations.
You can play around with the embedding by looking at it in TensorBoard. Run TensorBoard with:
tensorboard --logdir /tmp/tf-checkpoints/deepscite-noon
Then click on the "Embedding" tab.
![](images/embedding.gif)