GitHub - vyraun/long-tailed: Code for "On Long-Tailed Phenomena in NMT".

On Long-Tailed Phenomena in NMT. Findings of EMNLP 2020.

Warning: The crux of the code, the Focal loss and Anti-Focal loss implementations are available in the fairseq/criterions directory and can be directly used with fairseq. However, the end-to-end code currently is more of a code-dump, rather than a code-release. Please wait for a (much better) cleaned-up version.

We use fairseq to train the models. Our code is tested on Ubuntu 18.04, with a Conda installation of Python 3.6.

git clone https://github.com/vyraun/long-tailed.git
pip install .

Other Repositories Used (thanks!):

Steps to Replicate

Below are the steps to replicate each section of the paper.

Section 1: Train the Cross-Entropy Baseline Transformer

The scripts with the prefix 'run' provides the code, from data preparation to evaluation. For example:

bash run_iwslt14_de_en.sh

Compute the Spearman's Rank Correlation between Norms and Frequencies:

python norm.py

Section 2: Characterizing the Long Tail

cd analysis
bash evauate_splits.sh [model_dir]
bash evauate_model_on_splits.sh [model_dir]

The plot can be generated using compare-mt

Section 3: Analyze Beam Search

bash evaluate.sh model_dir data_dir
python probs_new.py beam_search.pkl
python probs_all.py [beam_search_*.pkl]

Section 4: Train Transformer using Focal and Anti-Focal Losses

The loss functions are implemented in the Criterions Directory.

bash run_iwslt14_de_fc.sh
bash run_iwslt14_de_afc.sh

Section 5: Tau Normalization Baseline

cd analysis
bash normalization.sh

Citation

@inproceedings{raunak2020longtailed,
  title = {On Long-Tailed Phenomena in Neural Machine Translation},
  author = {Raunak, Vikas and Dalmia, Siddharth and Gupta, Vivek and Metze, Florian},
  booktitle = {Findings of EMNLP},
  year = 2020,
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
analysis		analysis
docs		docs
evaluate		evaluate
examples		examples
fairseq.egg-info		fairseq.egg-info
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
tests		tests
train		train
LICENSE		LICENSE
README.md		README.md
eval_lm.py		eval_lm.py
generate.py		generate.py
hubconf.py		hubconf.py
interactive.py		interactive.py
preprocess.py		preprocess.py
score.py		score.py
setup.py		setup.py
train.py		train.py
validate.py		validate.py

License

vyraun/long-tailed

Folders and files

Latest commit

History

Repository files navigation

On Long-Tailed Phenomena in NMT. Findings of EMNLP 2020.

Steps to Replicate

Section 1: Train the Cross-Entropy Baseline Transformer

Section 2: Characterizing the Long Tail

Section 3: Analyze Beam Search

Section 4: Train Transformer using Focal and Anti-Focal Losses

Section 5: Tau Normalization Baseline

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages