This document is intended to serve as a reference for possible data sources and parametrization for the training of Transformer language models. The focus is set on models which have been trained specifically for the German language. Other languages, but also the combination of several languages, ("multilingual") are not considered.
As with all machine learning and deep learning systems, the selection of suitable data is the most important basis for any further proceeding. Datasets that have been used so far are therefore briefly presented.
The creation of Transformer based Natural Language Processing (NLP) models is based on the generation of a
general language model (pre-training
) and the specialization on use cases (domain specific tasks
aka.
fine-tuning
). Both types of models are therefore presented. In some cases, it is worthwhile to further train
existing models and thus optimize them for individual purposes. The lists available here are meant to provide a basis
for making a well reasoned decision.
The lists represent the data available from the 01.02.2021.
The vast majority of Transformer training approaches are open source and can be performed by any machine learning enthusiast. For this reason it is almost impossible to know on which servers around the globe publicly available Transformer models are located.
To cut down the effort of searching for models, the official list of Hugging Face is used as a data basis. With its open source library "Hugging Face Transformers", the company has created a recognized platform for training and documentation of Transformer models. After fundraising, the company was able to acquire $15 million and aims to create the "definitive natural language library" (last visited: 2021/02/01).
Queries for models in Hugging Face's list are limited to German models via two methods: Either the model name contains
the terms german, deutsch, -de, de-, deu, ger
or the official tags, that can be defined for each model, contain
these terms. The resulting list is then manually checked and false positives are sorted out.
Despite the restriction to partial terms, it cannot be ruled out that German-language models remain under the radar because they do not contain any indication of their classification.
Models that are optimized for several languages at the same time ("multilingual") are currently not considered since optimizations for the German language are unlikely.
Whenever the symbol *
appears in one of the tables, the given information was not explicitly mentioned in the
description, but can be retrieved by investigation. For example, if an author writes something like
"We have used the same configuration as model FOO" or if a quick peek into one of the associated configuration
files allows a statement to be made.
If ?
appears, then there is no way to determine this information without the help of the model creator.
In case you see something written in quotation marks (""
) it is a quote from the documentation.
Quotations are mainly used when the author of a model uses an informal spelling.
Primarily the description, which is supplied in form of a README
file. In addition, the model files
(e.g. config.json
, vocab.txt
, ...) are analysed.
Papers
, Blogs
and other publications
are also used when available.
This project is a collaboration between Technical University of Applied Sciences Wildau (TH Wildau) and sense.ai.tion GmbH. You can contact us via:
- Philipp Müller (M.Eng.); Author
- Prof. Dr. Janett Mohnke; TH Wildau
- Dr. Matthias Boldt, Jörg Oehmichen; sense.AI.tion GmbH
This project was funded by the European Regional Development Fund (EFRE) and the State of Brandenburg. Project/Vorhaben: "ProFIT: Natürlichsprachliche Dialogassistenten in der Pflege".