Replies: 2 comments
-
@pyf98, can you answer it? |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hi, thanks for the question. We used two letter codes for v1 and v2. Then we switched to three letter codes for v3 in order to support more languages. You can find the token vocabulary in each model (you can see it in the config file in an uploaded model on Hugging Face) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am confused about the language codes in OWSM models. The token list in the OWSM models on Huggingface has three letter codes (
<eng>", "<deu>
), but data preparation scripts in underegs2/mixed_v2/
(e.g. this) use two letter codes (<en>
, etc). Can you clarify this? Which one should we use when finetuning OWSM model on custom data?Beta Was this translation helpful? Give feedback.
All reactions