return stride information in callback from speech recognition #605

shopped · 2024-02-25T23:43:59Z

In the whisper web repo, I noticed that time stamp generation was off while being streamed in - but correct after the transcription is complete. If you look at the demo video, and pause during generation, you will see a 1:02 timestamp for a 1 minute audio clip - which is overwritten upon completion to 0:57.

This is because timestamps are generated correctly when they are passed in as completed chunks, but are off when partial updates are streamed back from the model generation. The callback from completed chunks in pipelines.js contains stride information, but the callback during partial generation in models.js does not, and stride information is needed to get correct timestamps from the decode_asr function in tokenizers.js.

This is not a major issue right now, since the timestamps are only off by 5s, and the error is transient and corrects itself. However I want to make some improvements on whisper-web, like adding word level timestamps, making decoding more efficient, and adding streaming audio transcriptions from the microphone... and the timestamp discrepancies matter more in those code changes. I can do a workaround to solve this error in a fork of whisper-web, but making the change here is cleaner. If this code change is merged in, the timestamp issue can be mitigated by adding last.stride = item.stride; after line 134 in worker.js in whisper-web - I can open a PR for that

wobbble · 2024-04-04T07:10:56Z

Hey @shopped
I have tried to implement your changes with my fork of transformers.js and then also changes in whisper-web, but it didn't help.
From digging in it I have found out that single change of value of stride_length_s from 5 to 3 in whisper-web helped me.
Here is detailed too and example audio.
#551 (comment)

Do you think that change is can be sustainable? or lowering stride will occasionally help only with some files?
Thanks!

shopped · 2024-04-10T23:04:01Z

Hey @shopped I have tried to implement your changes with my fork of transformers.js and then also changes in whisper-web, but it didn't help. From digging in it I have found out that single change of value of stride_length_s from 5 to 3 in whisper-web helped me. Here is detailed too and example audio. #551 (comment)

Do you think that change is can be sustainable? or lowering stride will occasionally help only with some files? Thanks!

Hey thanks for taking a look and trying to reproduce. Yes, I hardcoded a stride length change too as a hack to get correct values initially, but decided to dig deeper and that's how I arrived at this code change. It is odd though that you are not able to reproduce - are you sure that your code references the local transformers.js change without hitting a vite cache or CDN cache?

Also, I have been getting correct word level timestamps during testing, but this PR is addressing incorrect chunk level timestamps. I have been busy with other projects but if I get some time and am back working on this, I will take a look at your specific issue. Cheers!

return stride information in callback from speech recognition

77666d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

return stride information in callback from speech recognition #605

return stride information in callback from speech recognition #605

shopped commented Feb 25, 2024

wobbble commented Apr 4, 2024

shopped commented Apr 10, 2024

return stride information in callback from speech recognition #605

Are you sure you want to change the base?

return stride information in callback from speech recognition #605

Conversation

shopped commented Feb 25, 2024

wobbble commented Apr 4, 2024

shopped commented Apr 10, 2024