Faster whisper loads the wrong tokenizer for whisper-large-v3 derivatives #835

AmgadHasan · 2024-05-13T09:35:25Z

Hi
If the tokenizer.json isn't available in the model directory, the faster-whisper loaded automatically downloads the tokenizer from huggingface which is a good thing. However, it always downloads the openai/whisper-tiny tokenizer. This can cause problems if the model used is or derived from whisper-large-v3 as it has a different tokenizer; the task token_ids are offset by 1 since it has introduced a new language id.

Can we modify the code so that

tokenizer_file = os.path.join(model_path, "tokenizer.json")
        if tokenizer_bytes:
            self.hf_tokenizer = tokenizers.Tokenizer.from_buffer(tokenizer_bytes)
        elif os.path.isfile(tokenizer_file):
            self.hf_tokenizer = tokenizers.Tokenizer.from_file(tokenizer_file)
        else:
            self.hf_tokenizer = tokenizers.Tokenizer.from_pretrained(
                "openai/whisper-tiny" + ("" if self.model.is_multilingual else ".en")
            )

The text was updated successfully, but these errors were encountered:

AmgadHasan · 2024-05-13T09:36:25Z

I created a PR to fix this issue
#834

AmgadHasan · 2024-05-13T09:37:43Z

Can you please check this issue and related PR?
@trungkienbkhn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster whisper loads the wrong tokenizer for whisper-large-v3 derivatives #835

Faster whisper loads the wrong tokenizer for whisper-large-v3 derivatives #835

AmgadHasan commented May 13, 2024

AmgadHasan commented May 13, 2024

AmgadHasan commented May 13, 2024

Faster whisper loads the wrong tokenizer for whisper-large-v3 derivatives #835

Faster whisper loads the wrong tokenizer for whisper-large-v3 derivatives #835

Comments

AmgadHasan commented May 13, 2024

AmgadHasan commented May 13, 2024

AmgadHasan commented May 13, 2024