English text normalization utilization for Eager Streaming Mode #111

atiorh · 2024-04-08T23:10:55Z

Eager Streaming Mode relies on confirming the currently predicted text tokens with at least 1 redundant historical prediction.
Whisper is susceptible to outputting tokens that trivially differ (e.g. "gonna" vs "going to", "amortisation" vs "amortization") for almost identical audio input. This happens occasionally and causes unnecessary slowdown due to missed opportunities to confirm predicted text tokens earlier.
Memory and Latency Regression Tests #99 implements English Text Normalization which can be integrated into the token confirmation logic in Eager Streaming Mode to avoid these unnecessary slowdowns.
Note that this would not intervene in the actually predicted tokens and the associated KV cache. This only changes the criterion for confirmation in "near matches with a trivial string variation".

The text was updated successfully, but these errors were encountered:

ZachNagengast · 2024-05-07T15:53:50Z

Utilities to help with this will be included with #120

ZachNagengast linked a pull request May 7, 2024 that will close this issue

English Normalisation and WER Utils #120

Draft

4 tasks

ZachNagengast removed a link to a pull request May 7, 2024

English Normalisation and WER Utils #120

Draft

4 tasks

Provide feedback