ByteLevelBPETokenizer output seems weird #203

seyyaw · 2020-03-24T06:53:06Z

I use the ByteLevelBPETokenizer to train a custom tokenizer for Amharic language (less-resource language).

tokenizer = ByteLevelBPETokenizer(lowercase=False)

tokenizer.train(files=paths, vocab_size=32000, min_frequency=3, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

The merge.txt and vocab.json files I obtained are now not human readable.

áĪ ħ
áĪ °
ĠáĬ¥ áĬķ
ĠáĬ ¨
Ġáĭ Ń
áĬ Ń
ĠáĪ Ī
áį į

Also the encoding results in the same unreadable output

output = tokenizer.encode("አበበ በሶ በላ። ጫላ ጩቤ ጨበጠ፡፡")
print(output.ids, output.tokens, output.offsets)
>>>[0, 319, 5739, 2883, 4037, 303, 1631, 299, 5173, 506, 748, 11918, 363, 2] ['<s>', 'áĬł', 'áīłáīł', 'ĠáīłáĪ¶', 'ĠáīłáĪĭ', 'áį¢', 'ĠáĮ«', 'áĪĭ', 'ĠáĮ©', 'áī¤', 'ĠáĮ¨', 'áīłáĮł', 'áį¡áį¡', '</s>'] [(0, 0), (0, 3), (3, 9), (9, 16), (16, 23), (23, 26), (26, 30), (30, 33), (33, 37), (37, 40), (40, 44), (44, 50), (50, 56), (0, 0)]

Is this the expected behavior? I will later use this to train a RoberTa model using the run_language_modeling.py script.

Thanks

The text was updated successfully, but these errors were encountered:

taesiri · 2020-03-27T15:22:01Z

I also have a similar issue with Persian texts.

n1t0 · 2020-03-27T16:49:33Z

Hey @seyyaw, @taesiri.

TLDR; This is how the byte-level BPE works. Main advantages are:

Smaller vocabularies
No unknown token

This is totally expected behavior. The byte-level BPE converts all the Unicode code points into multiple byte-level characters:

Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)
Each byte value gets a "visible" character assigned to it from the beginning of the Unicode table. This is especially important because there are a lot of control characters, so we can't just have a simple mapping ASCII Table character <-> byte value. So some characters get other representations, like for example the white space U+0020 becomes Ġ.

The purpose is, by doing so, you end up with an initial alphabet of 256 tokens. These 256 tokens can then be merged together to represent any other token in the vocabulary. This results in smaller vocabularies, that won't ever need an "unknown" token.

n1t0 changed the title ~~ByteLevelBPETokenizer encode Amharic tokens wrongly~~ ByteLevelBPETokenizer output seems weird Mar 27, 2020

n1t0 closed this as completed Mar 27, 2020

julien-c mentioned this issue Apr 8, 2020

ByteLevelBPETokenizer with Greek gives weird symbols. #223

Closed

n1t0 mentioned this issue May 7, 2020

encoding problem when training for Russian #254

Closed

n1t0 reopened this May 7, 2020

huggingface locked as resolved and limited conversation to collaborators May 7, 2020

n1t0 pinned this issue May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByteLevelBPETokenizer output seems weird #203

ByteLevelBPETokenizer output seems weird #203

seyyaw commented Mar 24, 2020

taesiri commented Mar 27, 2020 •

edited

n1t0 commented Mar 27, 2020

ByteLevelBPETokenizer output seems weird #203

ByteLevelBPETokenizer output seems weird #203

Comments

seyyaw commented Mar 24, 2020

taesiri commented Mar 27, 2020 • edited

n1t0 commented Mar 27, 2020

taesiri commented Mar 27, 2020 •

edited