Skip to content

anafisa/Text2Text-Transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Large Multi-Language Models for News Translation

  • In this repo you may find examples how to fine-tune Large Language Models (LLM) and apply them to the real task of news translation.
  • Also in this repo we provide news parser, so you can easily parse any news web page you want (for example CNN, BBC news) and test how pre-trained LLM would translate parsed real news.
Снимок экрана 2023-12-18 в 14 48 37

1. Facebook: M2M100

Facebook: M2M100 (1.2b parameters) - is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks, covering 100 languages.

All available languages: Afrikaans, Amharic, Arabic, Asturian, Azerbaijani, Bashkir, Belarusian , Bulgarian, Bengali, Breton, Bosnian, Catalan; Valencian, Cebuano, Czech, Welsh, Danish, German, Greeek, English, Spanish, Estonian, Persian, Fulah, Finnish, French, Western Frisian, Irish, Gaelic; Scottish Gaelic , Galician, Gujarati, Hausa, Hebrew, Hindi, Croatian, Haitian; Haitian Creole, Hungarian, Armenian, Indonesian , Igbo, Iloko, Icelandic, Italian, Japanese, Javanese, Georgian, Kazakh, Central Khmer, Kannada, Korean , Luxembourgish; Letzeburgesch, Ganda, Lingala, Lao, Lithuanian, Latvian, Malagasy, Macedonian, Malayalam, Mongolian, Marathi, Malay, Burmese, Nepali, Dutch; Flemish, Norwegian, Northern Sotho, Occitan (post 1500), Oriya, Panjabi; Punjabi, Polish, Pushto; Pashto, Portuguese, Romanian; Moldavian; Moldovan , Russian, Sindhi, Sinhala; Sinhalese, Slovak, Slovenian , Somali, Albanian, Serbian, Swati, Sundanese, Swedish, Swahili, Tamil, Thai, Tagalog, Tswana, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Wolof, Xhosa, Yiddish, Yoruba, Chinese, Zulu

2. Google: mT5

Google: mT5 (1.2b parameters) - mT5 is pretrained on the mC4 corpus, covering 101 languages.

All available languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.

Снимок экрана 2023-12-19 в 11 54 35