Are timestamp tokens used in previous text? #794

George0828Zhang · 2024-04-19T09:07:13Z

According to Figure 1 in the whisper paper, during training, the previous text tokens does not contain timestamp tokens. However, using trascribe with without_timestamps=False and condition_on_previous_text=True, the prompt tokens (which contains previous text) get passed into model with the timestamp tokens. Print out prompt right before this line to confirm:

faster-whisper/faster_whisper/transcribe.py

Line 850 in 91c8307

result = self.model.generate(

There were several timestamp tokens in there.

If this is indeed true, wouldn't it cause train-test mismatch?

Timestamp tokens had never appeared before <|startoftranscript|> during training, but now they do in inference.
Timestamp tokens are only from the range [0, 30]. It does not make sense for both the previous and the current segments to have timestamps within [0, 30]. For example, Previous segment:

<|startofprev|><|0.00|>This is sentence 1.<|1.02|>...<|29.02|>This is another.<|29.54|>

Current transcript:

<|startoftranscript|><|0.02|>This is another.<|1.04|>...

As I understand the openai implementation also did this, so I opened a discussion there as well: openai/whisper#2140

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are timestamp tokens used in previous text? #794

Are timestamp tokens used in previous text? #794

George0828Zhang commented Apr 19, 2024 •

edited

Are timestamp tokens used in previous text? #794

Are timestamp tokens used in previous text? #794

Comments

George0828Zhang commented Apr 19, 2024 • edited

George0828Zhang commented Apr 19, 2024 •

edited