Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteLevelBPETokenizer output seems weird #203

Open
seyyaw opened this issue Mar 24, 2020 · 2 comments
Open

ByteLevelBPETokenizer output seems weird #203

seyyaw opened this issue Mar 24, 2020 · 2 comments

Comments

@seyyaw
Copy link

seyyaw commented Mar 24, 2020

I use the ByteLevelBPETokenizer to train a custom tokenizer for Amharic language (less-resource language).

tokenizer = ByteLevelBPETokenizer(lowercase=False)

tokenizer.train(files=paths, vocab_size=32000, min_frequency=3, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

The merge.txt and vocab.json files I obtained are now not human readable.

áĪ ħ
áĪ °
ĠáĬ¥ áĬķ
ĠáĬ ¨
Ġáĭ Ń
áĬ Ń
ĠáĪ Ī
áį į

Also the encoding results in the same unreadable output

output = tokenizer.encode("አበበ በሶ በላ። ጫላ ጩቤ ጨበጠ፡፡")
print(output.ids, output.tokens, output.offsets)
>>>[0, 319, 5739, 2883, 4037, 303, 1631, 299, 5173, 506, 748, 11918, 363, 2] ['<s>', 'áĬł', 'áīłáīł', 'ĠáīłáĪ¶', 'ĠáīłáĪĭ', 'áį¢', 'ĠáĮ«', 'áĪĭ', 'ĠáĮ©', 'áī¤', 'ĠáĮ¨', 'áīłáĮł', 'áį¡áį¡', '</s>'] [(0, 0), (0, 3), (3, 9), (9, 16), (16, 23), (23, 26), (26, 30), (30, 33), (33, 37), (37, 40), (40, 44), (44, 50), (50, 56), (0, 0)]

Is this the expected behavior? I will later use this to train a RoberTa model using the run_language_modeling.py script.

Thanks

@taesiri
Copy link

taesiri commented Mar 27, 2020

I also have a similar issue with Persian texts.

@n1t0
Copy link
Member

n1t0 commented Mar 27, 2020

Hey @seyyaw, @taesiri.

TLDR; This is how the byte-level BPE works. Main advantages are:

  • Smaller vocabularies
  • No unknown token

This is totally expected behavior. The byte-level BPE converts all the Unicode code points into multiple byte-level characters:

  1. Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)
  2. Each byte value gets a "visible" character assigned to it from the beginning of the Unicode table. This is especially important because there are a lot of control characters, so we can't just have a simple mapping ASCII Table character <-> byte value. So some characters get other representations, like for example the white space U+0020 becomes Ġ.

The purpose is, by doing so, you end up with an initial alphabet of 256 tokens. These 256 tokens can then be merged together to represent any other token in the vocabulary. This results in smaller vocabularies, that won't ever need an "unknown" token.

@n1t0 n1t0 changed the title ByteLevelBPETokenizer encode Amharic tokens wrongly ByteLevelBPETokenizer output seems weird Mar 27, 2020
@n1t0 n1t0 closed this as completed Mar 27, 2020
@n1t0 n1t0 reopened this May 7, 2020
@huggingface huggingface locked as resolved and limited conversation to collaborators May 7, 2020
@n1t0 n1t0 pinned this issue May 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants