You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm training a new tokenizer on an Indic language, Tamil. I tried two different runs:
Test run with part of the data used for training ~0.3Gb
from datasets import load_dataset
from tokenizers import Tokenizer, trainers, models, pre_tokenizers
ta_data = load_dataset("ai4bharat/sangraha", cache_dir='./datasets', data_files='verified/tam/data-0.parquet', split='train')
ta_data = ta_data.remove_columns([
col for col in ta_data.column_names if col != "text"
])
def batch_iterator(dataset, batch_size):
for i in range(0, len(dataset), batch_size):
yield dataset[i : i + batch_size]["text"]
tokenizer = Tokenizer(models.BPE(unk_token='[UNK]'))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.add_special_tokens(['[ta]'])
special_tokens = ["[STOP]","[UNK]","[SPACE]","[ta]"]
trainer = trainers.BpeTrainer(vocab_size=2000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(ta_data, 100), trainer=trainer, length=len(ta_data))
tokenizer.save('./ta_vocab_pretok_2000.json')
This gives me a vocab file with exactly 2000 tokens as here and merges are computed correctly. ta_vocab_pretok_2000.json
Run with entire data used for training ~ 15Gb
from datasets import load_dataset
from tokenizers import Tokenizer, trainers, models, pre_tokenizers
ta_data = load_dataset("ai4bharat/sangraha", cache_dir='./datasets', data_files='verified/tam/*', split='train')
ta_data = ta_data.remove_columns([
col for col in ta_data.column_names if col != "text"
])
def batch_iterator(dataset, batch_size):
for i in range(0, len(dataset), batch_size):
yield dataset[i : i + batch_size]["text"]
tokenizer = Tokenizer(models.BPE(unk_token='[UNK]'))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.add_special_tokens(['[ta]'])
special_tokens = ["[STOP]","[UNK]","[SPACE]","[ta]"]
trainer = trainers.BpeTrainer(vocab_size=2000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(ta_data, 100), trainer=trainer, length=len(ta_data))
tokenizer.save('./ta_vocab_pretok_2000.json')
This gives me a much larger vocab file with no merges. Vocab count is ~5800, ignores the value 2000 I passed to the trainer. ta_vocab_pretok_2000_full_data.json
Questions:
Why does the tokenizer ignore the vocab_size parameter in the trainer ?
Where are the non-tamil tokens coming from ? The emojis, the greek, arabic and other language tokens ?
The text was updated successfully, but these errors were encountered:
+1 also encountered same problem, tried use trainers.WordPieceTrainer, and BertTokenizerFast.train_new_from_iterator also have same result, they also not respects vocab_size parameter
I'm training a new tokenizer on an Indic language, Tamil. I tried two different runs:
Test run with part of the data used for training ~0.3Gb
This gives me a vocab file with exactly 2000 tokens as here and merges are computed correctly.
ta_vocab_pretok_2000.json
Run with entire data used for training ~ 15Gb
This gives me a much larger vocab file with no merges. Vocab count is ~5800, ignores the value 2000 I passed to the trainer.
ta_vocab_pretok_2000_full_data.json
Questions:
The text was updated successfully, but these errors were encountered: