Machine Translation and Multilinguality in Text classification.

Team Members: Nesma Mahmoud B87771, Mahmoud Kamel B87770 This project is related to at the University of Tartu, Institute of Computer Science. Our project consists of two main parts:

Handling multilinguality in text classification
Expanding the available data with Round-trip-translation

Datset:

multilingual-text-categorization-dataset This data set contains blog posts in 32 Language categorized into 45 Category.

Categories: ['advertising', 'agriculture', 'animation', 'arts_and_crafts', 'entertainment', 'astrology', 'vehicles', 'games', 'books_and_literature', 'business', 'gambling', 'jobs', 'clothing', 'comic_books', 'dating', 'education', 'adult', 'food', 'health', 'hobbies_and_interests', 'humor', 'illegal_content', 'investing', 'jewelry', 'logistics', 'marketing', 'movies', 'music', 'hacking', 'media', 'finance', 'pets', 'politics', 'religion', 'sci_fi_and_fantasy', 'science', 'shopping', 'society', 'sports', 'tech', 'teens', 'television', 'travel', 'under_construction', 'weather']
Languages: ['english', 'albanian', 'arabic', 'bulgarian', 'chinese', 'croatian', 'czech', 'danish', 'dutch', 'estonian', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hungarian', 'icelandic', 'italian', 'japanese', 'korean', 'lithuanian', 'norwegian', 'polish', 'portuguese', 'romanian', 'russian', 'serbian', 'slovenian', 'spanish', 'swedish', 'turkish', 'ukrainian'],

Project Scope:

1. Handling multilinguality in text classification:

In this part we will try three different ways for multilingual text classification, and compare between them. The three differet methods are:

Comparing Joint multilingual approach: we classify all of the languages together with single classification system (can be also ensemble of multilingual models)
Joint translated monolingual: all languages are translated into one super-language - prolly english - and then classified all together.
multiple monolingual classification approach: each language has a separate classification system trained to it.

2. Expanding the available data with Round-trip-translation:

This part involves testing how to best leverage the increased diversity that RT-translation brings to the data.

Frameworks:

1- Keras
2- AllenNlp
3- FLAIR

Experiments:

1- Keras Experiments and Results can be found here, except there are some experiments that was run over the server as:

The Translation model results and code which can be found here ** The Datasets Translation to English Using IBM can be found here

** The Languages after being Translated to lenguage can be found here
** we also Tried using Google Translation, but we reached the limit, and we couldn't find a way around that, That's why we moved to IBM Watson for the translations. which also have a limit but we managed to work around it.

The Joint Multilingual model results and code which can be found here

2- AllenNlp Results are Included in this notebook For Running AllenNlp from Configuration file, we need those files:

DataReader Class can be found here
Predictor Class can be found here
Model Class can be found here
Configuration file can be found here

3-Flair Experiments Results:

We used Bert through Flair Framework, but it failed because Bert can only work for data sequences that are less than 512, and our dataset has articles with more number of sequences. you can see the experiment here
After that we tried decreasing the number of tokens per each article which can be found here but it resulted in a bad results also so we decided not to continue with FLAIR
We used FLAIR stacked embeddings for english classification. we found that it consumes alot of resources but as a POC we trained it over a subset of the english dataset. and it worked well for this subset. you can find the experiment here: This is the POC
After that we Trained it over all the English dataset, code here
But it resulted in a very strange results, which can be seen from here

Visualization code of the graphs that are used in the blog post can be found here

Blog post can be found here

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.ipynb_checkpoints		.ipynb_checkpoints
AllenNlp		AllenNlp
Data		Data
Flair		Flair
Keras		Keras
Visualization		Visualization
latex		latex
EnglishData_Classification_without_preprocessing.ipynb		EnglishData_Classification_without_preprocessing.ipynb
EnglishData_classification_with_preprocessing.ipynb		EnglishData_classification_with_preprocessing.ipynb
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

AllenNlp

AllenNlp

Data

Data

Flair

Flair

Keras

Keras

Visualization

Visualization

latex

latex

EnglishData_Classification_without_preprocessing.ipynb

EnglishData_Classification_without_preprocessing.ipynb

EnglishData_classification_with_preprocessing.ipynb

EnglishData_classification_with_preprocessing.ipynb

README.md

README.md

_config.yml

_config.yml

Repository files navigation

Machine Translation and Multilinguality in Text classification.

Datset:

Project Scope:

1. Handling multilinguality in text classification:

2. Expanding the available data with Round-trip-translation:

Frameworks:

Experiments:

About

Releases

Packages

Contributors 2

Languages

nesmaAlmoazamy/Handling_Multilinguality

Folders and files

Latest commit

History

Repository files navigation

Machine Translation and Multilinguality in Text classification.

Datset:

Project Scope:

1. Handling multilinguality in text classification:

2. Expanding the available data with Round-trip-translation:

Frameworks:

Experiments:

About

Topics

Resources

Stars

Watchers

Forks

Languages