Transformer based language models for the German language

This document is intended to serve as a reference for possible data sources and parametrization for the training of Transformer language models. The focus is set on models which have been trained specifically for the German language. Other languages, but also the combination of several languages, ("multilingual") are not considered.

As with all machine learning and deep learning systems, the selection of suitable data is the most important basis for any further proceeding. Datasets that have been used so far are therefore briefly presented.

The creation of Transformer based Natural Language Processing (NLP) models is based on the generation of a general language model (pre-training) and the specialization on use cases (domain specific tasks aka. fine-tuning). Both types of models are therefore presented. In some cases, it is worthwhile to further train existing models and thus optimize them for individual purposes. The lists available here are meant to provide a basis for making a well reasoned decision.

When were the lists last updated?

The lists represent the data available from the 01.02.2021.

How are the models selected?

The vast majority of Transformer training approaches are open source and can be performed by any machine learning enthusiast. For this reason it is almost impossible to know on which servers around the globe publicly available Transformer models are located.

To cut down the effort of searching for models, the official list of Hugging Face is used as a data basis. With its open source library "Hugging Face Transformers", the company has created a recognized platform for training and documentation of Transformer models. After fundraising, the company was able to acquire $15 million and aims to create the "definitive natural language library" (last visited: 2021/02/01).

Queries for models in Hugging Face's list are limited to German models via two methods: Either the model name contains the terms german, deutsch, -de, de-, deu, ger or the official tags, that can be defined for each model, contain these terms. The resulting list is then manually checked and false positives are sorted out.

Despite the restriction to partial terms, it cannot be ruled out that German-language models remain under the radar because they do not contain any indication of their classification.

Models that are optimized for several languages at the same time ("multilingual") are currently not considered since optimizations for the German language are unlikely.

What do the symbols (?, *, "") in the tables represent?

Whenever the symbol * appears in one of the tables, the given information was not explicitly mentioned in the description, but can be retrieved by investigation. For example, if an author writes something like "We have used the same configuration as model FOO" or if a quick peek into one of the associated configuration files allows a statement to be made.

If ? appears, then there is no way to determine this information without the help of the model creator.

In case you see something written in quotation marks ("") it is a quote from the documentation. Quotations are mainly used when the author of a model uses an informal spelling.

What sources are used?

Primarily the description, which is supplied in form of a README file. In addition, the model files (e.g. config.json, vocab.txt, ...) are analysed. Papers, Blogs and other publications are also used when available.

Acknowledgment

This project is a collaboration between Technical University of Applied Sciences Wildau (TH Wildau) and sense.ai.tion GmbH. You can contact us via:

Philipp Müller (M.Eng.); Author
Prof. Dr. Janett Mohnke; TH Wildau
Dr. Matthias Boldt, Jörg Oehmichen; sense.AI.tion GmbH

This project was funded by the European Regional Development Fund (EFRE) and the State of Brandenburg. Project/Vorhaben: "ProFIT: Natürlichsprachliche Dialogassistenten in der Pflege".

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
docs		docs
README.md		README.md
models_datasets.md		models_datasets.md
models_finetuning.md		models_finetuning.md
models_pretraining.md		models_pretraining.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

README.md

README.md

models_datasets.md

models_datasets.md

models_finetuning.md

models_finetuning.md

models_pretraining.md

models_pretraining.md

Repository files navigation

Transformer based language models for the German language

Table of contents

When were the lists last updated?

How are the models selected?

What do the symbols (?, *, "") in the tables represent?

What sources are used?

Acknowledgment

About

icampuswildau/german-transformers-overview

Folders and files

Latest commit

History

Repository files navigation

Transformer based language models for the German language

Table of contents

When were the lists last updated?

How are the models selected?

What do the symbols (?, *, "") in the tables represent?

What sources are used?

Acknowledgment

About

Topics

Resources

Stars

Watchers

Forks