Language codes in OWSM models #5529

alumae · 2023-11-04T14:41:24Z

alumae
Nov 4, 2023

I am confused about the language codes in OWSM models. The token list in the OWSM models on Huggingface has three letter codes (<eng>", "<deu>), but data preparation scripts in under egs2/mixed_v2/ (e.g. this) use two letter codes (<en>, etc). Can you clarify this? Which one should we use when finetuning OWSM model on custom data?

sw005320 · 2023-11-04T17:13:39Z

sw005320
Nov 4, 2023
Maintainer

@pyf98, can you answer it?

0 replies

pyf98 · 2023-11-04T17:17:44Z

pyf98
Nov 4, 2023
Maintainer

Hi, thanks for the question. We used two letter codes for v1 and v2. Then we switched to three letter codes for v3 in order to support more languages. You can find the token vocabulary in each model (you can see it in the config file in an uploaded model on Hugging Face)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language codes in OWSM models #5529

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Language codes in OWSM models #5529

alumae Nov 4, 2023

Replies: 2 comments

sw005320 Nov 4, 2023 Maintainer

pyf98 Nov 4, 2023 Maintainer

alumae
Nov 4, 2023

sw005320
Nov 4, 2023
Maintainer

pyf98
Nov 4, 2023
Maintainer