Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reproduce results from the paper #131

Closed
MLMonkATGY opened this issue May 13, 2024 · 6 comments · Fixed by #132
Closed

Unable to reproduce results from the paper #131

MLMonkATGY opened this issue May 13, 2024 · 6 comments · Fixed by #132

Comments

@MLMonkATGY
Copy link

Hi.
Can the exact code from run_eval.py be used to reproduce the results from Table 16 ? I tried to benchmark distill-whisper-v2 on distil-whisper/common_voice_13_0 dataset and found the WER is a few percent higher than what was reported in the paper?

@bryanyzhu
Copy link

+1

@sanchit-gandhi
Copy link
Collaborator

Hey @MLMonkATGY! Could you share the arguments you're passing to run_eval.py so that I can reproduce locally? I believe this is because we are using the BasicNormalizer in the PyTorch script run_eval.py:

normalizer = (
BasicTextNormalizer() if data_args.language is not None
else EnglishTextNormalizer(processor.tokenizer.english_spelling_normalizer)
)

Whereas in the original Flax scripts, we always used the EnglishNormalizer:

normalizer = EnglishTextNormalizer(tokenizer.english_spelling_normalizer)

You should be able to reproduce the results one-to-one if you use the Flax script. I'll also update the PyTorch script to use the EnglishNormalizer if the language used is English!

@MLMonkATGY
Copy link
Author

MLMonkATGY commented May 20, 2024

I used the following arguments for run_eval.py.

python run_eval.py \ --model_name_or_path "distil-whisper/distil-large-v2" \ --dataset_name distil-whisper/common_voice_13_0 \ --dataset_config_name en \ --dataset_split_name test \ --text_column_name text \ --batch_size 128 \ --dtype "bfloat16" \ --generation_max_length 256 \ --language "en" \ --attn_implementation "flash_attention_2" \ --streaming True

@sanchit-gandhi
Copy link
Collaborator

sanchit-gandhi commented May 21, 2024

Hey @MLMonkATGY, after merging #132, I evaluated the model with the following:

#!/bin/bash

python run_eval.py \
    --model_name_or_path "distil-whisper/distil-large-v2" \
    --dataset_name "distil-whisper/common_voice_13_0" \
    --dataset_config_name "en" \
    --dataset_split_name "test" \
    --text_column_name "text" \
    --batch_size 128 \
    --dtype "bfloat16" \
    --generation_max_length 256 \
    --language "en" \
    --streaming True

And got a WER of 13.0%: https://wandb.ai/sanchit-gandhi/distil-whisper-speed-benchmark/runs/7qihyqbx?nw=nwusersanchitgandhi

This is within 0.1% of the 12.9% WER reported in the paper. This 0.1% difference is expected, since the paper WER results are in Flax on TPU, whereas the run_eval.py script is in PyTorch on GPU. There's an inherent difference in how matrix multiplications are implemented in both, giving a subtle difference in results. Note that all WER results from the paper are in Flax, so the comparison between large-v2 and distil-large-v2 is valid. Note also that all RTF values in the paper were computed in PyTorch on GPU, such that they're most applicable to downstream use cases. I hope that helps!

@sanchit-gandhi
Copy link
Collaborator

All in all, the PR #132 should now mean that evaluating models in English with the PyTorch script run_eval.py gives WER results that are within 0.1% of the WER results quoted in the paper (using the Flax script flax/run_eval.py).

@MLMonkATGY
Copy link
Author

Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants