Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return stride information in callback from speech recognition #605

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shopped
Copy link

@shopped shopped commented Feb 25, 2024

In the whisper web repo, I noticed that time stamp generation was off while being streamed in - but correct after the transcription is complete. If you look at the demo video, and pause during generation, you will see a 1:02 timestamp for a 1 minute audio clip - which is overwritten upon completion to 0:57.

This is because timestamps are generated correctly when they are passed in as completed chunks, but are off when partial updates are streamed back from the model generation. The callback from completed chunks in pipelines.js contains stride information, but the callback during partial generation in models.js does not, and stride information is needed to get correct timestamps from the decode_asr function in tokenizers.js.

This is not a major issue right now, since the timestamps are only off by 5s, and the error is transient and corrects itself. However I want to make some improvements on whisper-web, like adding word level timestamps, making decoding more efficient, and adding streaming audio transcriptions from the microphone... and the timestamp discrepancies matter more in those code changes. I can do a workaround to solve this error in a fork of whisper-web, but making the change here is cleaner. If this code change is merged in, the timestamp issue can be mitigated by adding last.stride = item.stride; after line 134 in worker.js in whisper-web - I can open a PR for that

@wobbble
Copy link

wobbble commented Apr 4, 2024

Hey @shopped
I have tried to implement your changes with my fork of transformers.js and then also changes in whisper-web, but it didn't help.
From digging in it I have found out that single change of value of stride_length_s from 5 to 3 in whisper-web helped me.
Here is detailed too and example audio.
#551 (comment)

Do you think that change is can be sustainable? or lowering stride will occasionally help only with some files?
Thanks!

@shopped
Copy link
Author

shopped commented Apr 10, 2024

Hey @shopped I have tried to implement your changes with my fork of transformers.js and then also changes in whisper-web, but it didn't help. From digging in it I have found out that single change of value of stride_length_s from 5 to 3 in whisper-web helped me. Here is detailed too and example audio. #551 (comment)

Do you think that change is can be sustainable? or lowering stride will occasionally help only with some files? Thanks!

Hey thanks for taking a look and trying to reproduce. Yes, I hardcoded a stride length change too as a hack to get correct values initially, but decided to dig deeper and that's how I arrived at this code change. It is odd though that you are not able to reproduce - are you sure that your code references the local transformers.js change without hitting a vite cache or CDN cache?

Also, I have been getting correct word level timestamps during testing, but this PR is addressing incorrect chunk level timestamps. I have been busy with other projects but if I get some time and am back working on this, I will take a look at your specific issue. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants