Skip to content

Portuguese translation of the GLUE benchmark and Scitail dataset

License

Notifications You must be signed in to change notification settings

ju-resplande/PLUE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


PLUE: Portuguese Language Understanding Evaluation

https://fairytail.fandom.com/wiki/Plue
GitHub release (latest by date) GitHub GitHub Repo stars

Portuguese translation of the GLUE benchmark, SNLI, and Scitail
using OPUS-MT model and Google Cloud Translation.

Getting Started

Datasets Translation Tool
CoLA, MRPC, RTE, SST-2, STS-B, and WNLI Google Cloud Translation
SNLI, MNLI, QNLI, QQP, and SciTail OPUS-MT

Usage

Datasets 🤗

from datasets import load_dataset

data = load_dataset("dlb/plue", "cola")
# ['cola', 'sst2', 'mrpc', 'qqp_v2', 'stsb', 'snli', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'qnli_v2', 'rte', 'wnli', 'scitail']

Manual download (for large files)

Larger files are not hosted on github repository.

Structure

├── code ____________ # translation code and dependency parsing  
├── datasets
│   ├── CoLA
│   ├── MNLI
│   ├── MRPC
│   ├── QNLI
│   ├── QNLI_v2
│   ├── QQP_v2
│   ├── RTE
│   ├── SciTail
│   │   └── tsv_format
│   ├── SNLI
│   ├── SST-2
│   ├── STS-B
│   └── WNLI
└── pairs ____________ # translation pairs as JSON dictionary

Observations

  • GLUE provides two versions: first and second. We noticed the versions only differs in QNLI and QQP datasets, where we made QNLI available in both versions and QQP in the newest version.
  • LX parser, Binarizer code and NLTK word tokenizer were used to create dependency parsings for SNLI and MNLI datasets.
  • SNLI train split is a ragged matrix, so we made available two version of the data: train_raw.tsv contains irregular lines and train.tsv excludes those lines.
  • Manual translation were made on 12 sentences due to translation errors.
  • Our translation code is outdated. We recommend using from others.

Citing

@misc{Gomes2020,
  author = {GOMES, J. R. S.},
  title = {PLUE: Portuguese Language Understanding Evaluation},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ju-resplande/PLUE}},
  commit = {e7d01cb17173fe54deddd421dd735920964eb26f}
}

Acknowledgments

  • Deep Learning Brasil/CEIA
  • Cyberlabs