Skip to content

prats0599/hindi-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This repo includes building a state of the art language modelling and text classification architectures. We use ULMFiT which utilizes transfer learning on AWD-LSTMS to build great classifiers in any language needed. The idea is to first train a language model that predicts the next word given a set of words in that language(hindi here). We then extract the Wikipedia articles in that language and train the model; Finetune the language model on the classification dataset, save the encoder and use that as our classifier(after adding the fully connected layers) for sentiment analysis.
To improve results, we build another language model that predicts the previous word given a set of words, i.e. we feed the data in the reverse order and ask the model to predict word 1, given words n to 2, where n>2. We follow the same steps as above and after implementing the classification model, we simply ensemble to the two models and voila!

Datasets

  1. BBC News Articles : Sentiment analysis corpus for Hindi documents extracted from BBC news website.

  2. IITP Product Reviews : Sentiment analysis corpus for product reviews posted in Hindi.

  3. IITP Movie Reviews : Sentiment analysis corpus for movie reviews posted in Hindi.

Notebooks

nn-hindi: Contains code for Model 1 and ensembling.
nn-hindi-bwd: Code to train the models that predicts text backwards(Model 2).
bbc-hindi: Same code as nn-hindi.ipynb, but just for the bbc-articles dataset.

Results

Language Model Perplexity(on validation dataset which is randomly split)

Architecture Dataset Accuracy
ULMFiT Wikipedia-hi 30.17
ULMFiT Wikipedia-hi(backwards) 29.25

Classification Metrics(on test set)

Dataset Accuracy(Model 1) MCC(Model 1) Accuracy(Model 2) MCC(Model 2) Accuracy(ensemble) MCC(ensemble)
BBC Articles(14 classes) 79.79 72.58 78.75 71.15 84.39 79.13
IITP movie Reviews 58.39 38.34 61.94 43.68
IITP Product Reviews 72.08 54.19 75.90 59.83

Just by ensembling, we have outperformed classification benchmarks mentioned in this repository.

NOTE: MCC metric mentioned in the table refers to matthews correlation coefficient.

Download Pretrained models

The pretrained language models(both forward and backward) are available to download here.

Future Work

  • Train second model on IITP movie reviews and product reviews datasets.
  • Ensemble the other two models
  • Make a separate notebook for each dataset.
  • Experiment using transformers instead of LSTMs and compare results.

The full article on how to create your own SOTA model for language modelling & sentiment analysis is available here.

About

State-of-the-art text classfication and language modelling in Hindi.

Topics

Resources

Stars

Watchers

Forks