Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE Trainer doesn't respect the vocab_size parameter when dataset size is increased #1514

Open
Abhinay1997 opened this issue Apr 25, 2024 · 1 comment

Comments

@Abhinay1997
Copy link

I'm training a new tokenizer on an Indic language, Tamil. I tried two different runs:

Test run with part of the data used for training ~0.3Gb

from datasets import load_dataset
from tokenizers import Tokenizer, trainers, models, pre_tokenizers

ta_data = load_dataset("ai4bharat/sangraha", cache_dir='./datasets', data_files='verified/tam/data-0.parquet', split='train')
ta_data = ta_data.remove_columns([                                                             
        col for col in ta_data.column_names if col != "text"                                        
    ])           

def batch_iterator(dataset, batch_size):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

tokenizer = Tokenizer(models.BPE(unk_token='[UNK]'))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.add_special_tokens(['[ta]'])

special_tokens = ["[STOP]","[UNK]","[SPACE]","[ta]"]
trainer = trainers.BpeTrainer(vocab_size=2000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(ta_data, 100), trainer=trainer, length=len(ta_data))
tokenizer.save('./ta_vocab_pretok_2000.json')

This gives me a vocab file with exactly 2000 tokens as here and merges are computed correctly.
ta_vocab_pretok_2000.json

Run with entire data used for training ~ 15Gb

from datasets import load_dataset
from tokenizers import Tokenizer, trainers, models, pre_tokenizers

ta_data = load_dataset("ai4bharat/sangraha", cache_dir='./datasets', data_files='verified/tam/*', split='train')
ta_data = ta_data.remove_columns([                                                             
        col for col in ta_data.column_names if col != "text"                                        
    ])           

def batch_iterator(dataset, batch_size):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

tokenizer = Tokenizer(models.BPE(unk_token='[UNK]'))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.add_special_tokens(['[ta]'])

special_tokens = ["[STOP]","[UNK]","[SPACE]","[ta]"]
trainer = trainers.BpeTrainer(vocab_size=2000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(ta_data, 100), trainer=trainer, length=len(ta_data))
tokenizer.save('./ta_vocab_pretok_2000.json')

This gives me a much larger vocab file with no merges. Vocab count is ~5800, ignores the value 2000 I passed to the trainer.
ta_vocab_pretok_2000_full_data.json

Questions:

  1. Why does the tokenizer ignore the vocab_size parameter in the trainer ?
  2. Where are the non-tamil tokens coming from ? The emojis, the greek, arabic and other language tokens ?
@indiejoseph
Copy link

indiejoseph commented May 6, 2024

+1 also encountered same problem, tried use trainers.WordPieceTrainer, and BertTokenizerFast.train_new_from_iterator also have same result, they also not respects vocab_size parameter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants