BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased #1514

Abhinay1997 · 2024-04-25T03:45:05Z

I'm training a new tokenizer on an Indic language, Tamil. I tried two different runs:

Test run with part of the data used for training ~0.3Gb

from datasets import load_dataset
from tokenizers import Tokenizer, trainers, models, pre_tokenizers

ta_data = load_dataset("ai4bharat/sangraha", cache_dir='./datasets', data_files='verified/tam/data-0.parquet', split='train')
ta_data = ta_data.remove_columns([                                                             
        col for col in ta_data.column_names if col != "text"                                        
    ])           

def batch_iterator(dataset, batch_size):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

tokenizer = Tokenizer(models.BPE(unk_token='[UNK]'))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.add_special_tokens(['[ta]'])

special_tokens = ["[STOP]","[UNK]","[SPACE]","[ta]"]
trainer = trainers.BpeTrainer(vocab_size=2000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(ta_data, 100), trainer=trainer, length=len(ta_data))
tokenizer.save('./ta_vocab_pretok_2000.json')

This gives me a vocab file with exactly 2000 tokens as here and merges are computed correctly.
ta_vocab_pretok_2000.json

Run with entire data used for training ~ 15Gb

from datasets import load_dataset
from tokenizers import Tokenizer, trainers, models, pre_tokenizers

ta_data = load_dataset("ai4bharat/sangraha", cache_dir='./datasets', data_files='verified/tam/*', split='train')
ta_data = ta_data.remove_columns([                                                             
        col for col in ta_data.column_names if col != "text"                                        
    ])           

def batch_iterator(dataset, batch_size):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

tokenizer = Tokenizer(models.BPE(unk_token='[UNK]'))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.add_special_tokens(['[ta]'])

special_tokens = ["[STOP]","[UNK]","[SPACE]","[ta]"]
trainer = trainers.BpeTrainer(vocab_size=2000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(ta_data, 100), trainer=trainer, length=len(ta_data))
tokenizer.save('./ta_vocab_pretok_2000.json')

This gives me a much larger vocab file with no merges. Vocab count is ~5800, ignores the value 2000 I passed to the trainer.
ta_vocab_pretok_2000_full_data.json

Questions:

Why does the tokenizer ignore the vocab_size parameter in the trainer ?
Where are the non-tamil tokens coming from ? The emojis, the greek, arabic and other language tokens ?

The text was updated successfully, but these errors were encountered:

indiejoseph · 2024-05-06T13:46:01Z

+1 also encountered same problem, tried use trainers.WordPieceTrainer, and BertTokenizerFast.train_new_from_iterator also have same result, they also not respects vocab_size parameter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased #1514

BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased #1514

Abhinay1997 commented Apr 25, 2024

indiejoseph commented May 6, 2024 •

edited

BPE Trainer doesn't respect the vocab_size parameter when dataset size is increased #1514

BPE Trainer doesn't respect the vocab_size parameter when dataset size is increased #1514

Comments

Abhinay1997 commented Apr 25, 2024

indiejoseph commented May 6, 2024 • edited

BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased #1514

BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased #1514

indiejoseph commented May 6, 2024 •

edited