Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blank token id hard coded in C++ decoders #2329

Open
zhr1201 opened this issue Jan 31, 2024 · 3 comments
Open

Blank token id hard coded in C++ decoders #2329

zhr1201 opened this issue Jan 31, 2024 · 3 comments
Assignees

Comments

@zhr1201
Copy link
Contributor

zhr1201 commented Jan 31, 2024

Describe the bug
Following the PR that adds whisper feature extraction #2320. The transcript produced in that PR contains lots of <|notimestampes|>.
This is because i exported whisper tokenizers with 99 languages, but for whisper large v3, tokenizer it's 100 langs. And the token corresponding to the extra output in the transcript is actually <|nospeech|>, which is used as the blank token in CTC. Fixing the token list does produce reasonable output, but still significantly worse than the one produced by transcribe.py that simulates a streaming inference.

I think there is still some mismatch in the decoder part, at least It's not possible to pass in blank token id through CLI.
For ctc prefix beam search, there is a field in the config

int blank = 0; // blank id
, but not passed in through CLI
// DecodeOptions flags

and for wfst decoder, it assume the first token is blank.

float blank_score = std::exp(logp[i][0]);

To Reproduce
As described above

Expected behavior
CLI can handle blank tokens for different models, not just the case where the first token is blank.

[Bonus] decoder main produces the same result as transcribe.py for streaming finetuned whisper trained with hybrid ctc+attention loss

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@robin1001
Copy link
Collaborator

Yes, it's by design and it works fine before. It is a problem for whisper pretrained model since there is no blank in whisper tokenizer.

@robin1001
Copy link
Collaborator

@xingchensong any comment?

@xingchensong
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants