You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Following the PR that adds whisper feature extraction #2320. The transcript produced in that PR contains lots of <|notimestampes|>.
This is because i exported whisper tokenizers with 99 languages, but for whisper large v3, tokenizer it's 100 langs. And the token corresponding to the extra output in the transcript is actually <|nospeech|>, which is used as the blank token in CTC. Fixing the token list does produce reasonable output, but still significantly worse than the one produced by transcribe.py that simulates a streaming inference.
I think there is still some mismatch in the decoder part, at least It's not possible to pass in blank token id through CLI.
For ctc prefix beam search, there is a field in the config
Describe the bug
Following the PR that adds whisper feature extraction #2320. The transcript produced in that PR contains lots of <|notimestampes|>.
This is because i exported whisper tokenizers with 99 languages, but for whisper large v3, tokenizer it's 100 langs. And the token corresponding to the extra output in the transcript is actually <|nospeech|>, which is used as the blank token in CTC. Fixing the token list does produce reasonable output, but still significantly worse than the one produced by transcribe.py that simulates a streaming inference.
I think there is still some mismatch in the decoder part, at least It's not possible to pass in blank token id through CLI.
For ctc prefix beam search, there is a field in the config
wenet/runtime/core/decoder/ctc_prefix_beam_search.h
Line 30 in 4c81459
wenet/runtime/core/decoder/params.h
Line 75 in 4c81459
and for wfst decoder, it assume the first token is blank.
wenet/runtime/core/decoder/ctc_wfst_beam_search.cc
Line 80 in 4c81459
To Reproduce
As described above
Expected behavior
CLI can handle blank tokens for different models, not just the case where the first token is blank.
[Bonus] decoder main produces the same result as transcribe.py for streaming finetuned whisper trained with hybrid ctc+attention loss
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: