IndoELECTRA

IndoELECTRA: Pre-Trained Language Model for Indonesian Language Understanding

Overview

ELECTRA is a new method for self-supervised language representation learning. This repository contains the pre-trained Electra Base model (tensorflow 1.15.0) trained in a Large Indonesian corpus (~16GB of raw text | ~2B indonesian words).

According to the author's description:

Inspired by generative adversarial networks (GANs), ELECTRA trains the model to distinguish between “real” and “fake” input data. Instead of corrupting the input by replacing tokens with “[MASK]” as in BERT, our approach corrupts the input by replacing some input tokens with incorrect, but somewhat plausible, fakes. For example, in the below figure, the word “cooked” could be replaced with “ate”. While this makes a bit of sense, it doesn’t fit as well with the entire context. The pre-training task requires the model (i.e., the discriminator) to then determine which tokens from the original input have been replaced or kept the same.

Requirements

Python 3
TensorFlow 1.15 (although we hope to support TensorFlow 2.0 at a future date)
NumPy
scikit-learn and SciPy (for computing some evaluation metrics).

All models are trained using same tokenizer as BERT, BERT Tokenizer. Vocabulary file are built using WordPiece Library

IndoELECTRA Pre-Trained Models

For Tensorflow model can be downloaded here
For PyTorch model can be downloaded here or directly use from Transformers library provided by huggingface

Training

Please follow the root repository for training model.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
models		models
vocab		vocab
README.md		README.md
electra.png		electra.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models

models

vocab

vocab

README.md

README.md

electra.png

electra.png

Repository files navigation

IndoELECTRA

Overview

Requirements

IndoELECTRA Pre-Trained Models

Training

About

Releases

Packages

kingmetaa/IndoELECTRA

Folders and files

Latest commit

History

Repository files navigation

IndoELECTRA

Overview

Requirements

IndoELECTRA Pre-Trained Models

Training

About

Topics

Resources

Stars

Watchers

Forks