Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are timestamp tokens used in previous text? #794

Open
George0828Zhang opened this issue Apr 19, 2024 · 0 comments
Open

Are timestamp tokens used in previous text? #794

George0828Zhang opened this issue Apr 19, 2024 · 0 comments

Comments

@George0828Zhang
Copy link

George0828Zhang commented Apr 19, 2024

According to Figure 1 in the whisper paper, during training, the previous text tokens does not contain timestamp tokens. However, using trascribe with without_timestamps=False and condition_on_previous_text=True, the prompt tokens (which contains previous text) get passed into model with the timestamp tokens. Print out prompt right before this line to confirm:

result = self.model.generate(

There were several timestamp tokens in there.

If this is indeed true, wouldn't it cause train-test mismatch?

  1. Timestamp tokens had never appeared before <|startoftranscript|> during training, but now they do in inference.
  2. Timestamp tokens are only from the range [0, 30]. It does not make sense for both the previous and the current segments to have timestamps within [0, 30]. For example, Previous segment:
<|startofprev|><|0.00|>This is sentence 1.<|1.02|>...<|29.02|>This is another.<|29.54|>

Current transcript:

<|startoftranscript|><|0.02|>This is another.<|1.04|>...

As I understand the openai implementation also did this, so I opened a discussion there as well: openai/whisper#2140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant