Blank token id hard coded in C++ decoders #2329

zhr1201 · 2024-01-31T21:08:25Z

Describe the bug
Following the PR that adds whisper feature extraction #2320. The transcript produced in that PR contains lots of <|notimestampes|>.
This is because i exported whisper tokenizers with 99 languages, but for whisper large v3, tokenizer it's 100 langs. And the token corresponding to the extra output in the transcript is actually <|nospeech|>, which is used as the blank token in CTC. Fixing the token list does produce reasonable output, but still significantly worse than the one produced by transcribe.py that simulates a streaming inference.

I think there is still some mismatch in the decoder part, at least It's not possible to pass in blank token id through CLI.
For ctc prefix beam search, there is a field in the config

wenet/runtime/core/decoder/ctc_prefix_beam_search.h

Line 30 in 4c81459

int blank = 0; // blank id

, but not passed in through CLI

wenet/runtime/core/decoder/params.h

Line 75 in 4c81459

// DecodeOptions flags

and for wfst decoder, it assume the first token is blank.

wenet/runtime/core/decoder/ctc_wfst_beam_search.cc

Line 80 in 4c81459

float blank_score = std::exp(logp[i][0]);

To Reproduce
As described above

Expected behavior
CLI can handle blank tokens for different models, not just the case where the first token is blank.

[Bonus] decoder main produces the same result as transcribe.py for streaming finetuned whisper trained with hybrid ctc+attention loss

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

robin1001 · 2024-02-01T01:30:16Z

Yes, it's by design and it works fine before. It is a problem for whisper pretrained model since there is no blank in whisper tokenizer.

robin1001 · 2024-02-01T01:38:21Z

@xingchensong any comment?

xingchensong · 2024-02-01T01:58:24Z

we should make it configurable like this :

robin1001 assigned xingchensong Feb 1, 2024

zhr1201 mentioned this issue Feb 2, 2024

Whisper inference support in cpp runtime #2320

Merged

zhr1201 mentioned this issue Feb 22, 2024

[runtime] Configurable blank token idx #2366

Merged

github-actions bot added the Stale label Apr 2, 2024

github-actions bot closed this as completed Apr 10, 2024

Mddct reopened this Apr 10, 2024

github-actions bot removed the Stale label Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blank token id hard coded in C++ decoders #2329

Blank token id hard coded in C++ decoders #2329

zhr1201 commented Jan 31, 2024 •

edited

robin1001 commented Feb 1, 2024

robin1001 commented Feb 1, 2024

xingchensong commented Feb 1, 2024

Blank token id hard coded in C++ decoders #2329

Blank token id hard coded in C++ decoders #2329

Comments

zhr1201 commented Jan 31, 2024 • edited

robin1001 commented Feb 1, 2024

robin1001 commented Feb 1, 2024

xingchensong commented Feb 1, 2024

zhr1201 commented Jan 31, 2024 •

edited