GitHub - prats0599/hindi-nlp: State-of-the-art text classfication and language modelling in Hindi.

This repo includes building a state of the art language modelling and text classification architectures. We use ULMFiT which utilizes transfer learning on AWD-LSTMS to build great classifiers in any language needed. The idea is to first train a language model that predicts the next word given a set of words in that language(hindi here). We then extract the Wikipedia articles in that language and train the model; Finetune the language model on the classification dataset, save the encoder and use that as our classifier(after adding the fully connected layers) for sentiment analysis.
To improve results, we build another language model that predicts the previous word given a set of words, i.e. we feed the data in the reverse order and ask the model to predict word 1, given words n to 2, where n>2. We follow the same steps as above and after implementing the classification model, we simply ensemble to the two models and voila!

Datasets

BBC News Articles : Sentiment analysis corpus for Hindi documents extracted from BBC news website.
IITP Product Reviews : Sentiment analysis corpus for product reviews posted in Hindi.
IITP Movie Reviews : Sentiment analysis corpus for movie reviews posted in Hindi.

Notebooks

nn-hindi: Contains code for Model 1 and ensembling.
nn-hindi-bwd: Code to train the models that predicts text backwards(Model 2).
bbc-hindi: Same code as nn-hindi.ipynb, but just for the bbc-articles dataset.

Results

Language Model Perplexity(on validation dataset which is randomly split)

Architecture	Dataset	Accuracy
ULMFiT	Wikipedia-hi	30.17
ULMFiT	Wikipedia-hi(backwards)	29.25

Classification Metrics(on test set)

Dataset	Accuracy(Model 1)	MCC(Model 1)	Accuracy(Model 2)	MCC(Model 2)	Accuracy(ensemble)	MCC(ensemble)
BBC Articles(14 classes)	79.79	72.58	78.75	71.15	84.39	79.13
IITP movie Reviews	58.39	38.34	61.94	43.68
IITP Product Reviews	72.08	54.19	75.90	59.83

Just by ensembling, we have outperformed classification benchmarks mentioned in this repository.

NOTE: MCC metric mentioned in the table refers to matthews correlation coefficient.

Download Pretrained models

The pretrained language models(both forward and backward) are available to download here.

Future Work

Train second model on IITP movie reviews and product reviews datasets.
Ensemble the other two models
Make a separate notebook for each dataset.
Experiment using transformers instead of LSTMs and compare results.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
bbc-hindi.ipynb		bbc-hindi.ipynb
nn-hindi-bwd.ipynb		nn-hindi-bwd.ipynb
nn-hindi.ipynb		nn-hindi.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

bbc-hindi.ipynb

bbc-hindi.ipynb

nn-hindi-bwd.ipynb

nn-hindi-bwd.ipynb

nn-hindi.ipynb

nn-hindi.ipynb

Repository files navigation

Datasets

Notebooks

Results

Language Model Perplexity(on validation dataset which is randomly split)

Classification Metrics(on test set)

Download Pretrained models

Future Work

The full article on how to create your own SOTA model for language modelling & sentiment analysis is available here.

About

Languages

prats0599/hindi-nlp

Folders and files

Latest commit

History

Repository files navigation

Datasets

Notebooks

Results

Language Model Perplexity(on validation dataset which is randomly split)

Classification Metrics(on test set)

Download Pretrained models

Future Work

The full article on how to create your own SOTA model for language modelling & sentiment analysis is available here.

About

Topics

Resources

Stars

Watchers

Forks

Languages