Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing Whisper STT issues #5929

Open
wants to merge 50 commits into
base: dev
Choose a base branch
from

Conversation

mamei16
Copy link
Contributor

@mamei16 mamei16 commented Apr 24, 2024

Even after the first Whisper STT issue following the big Gradio update was fixed (#5856), multiple others are still present. The purpose of this PR is to address the issues by either implementing workarounds, or ideally fixing the underlying problems, provided they don't originate in Gradio itself.

Issue 1: Chrome only (workaround found)

Reports: #5869, #5920, #5805

Description

Users report the following exception message after recording audio in Chrome: audioop.error: not a whole number of frames.
Looking into the issue, I noticed that the audio data obtained when recording audio with Firefox has the following format:

(44100, array([[0, 0],
       [0, 0],
       [0, 0],
       ...,
       [0, 0],
       [0, 0],
       [0, 0]], dtype=int16))

Where the first tuple item is the sample rate, and the second item is the array of samples. Notice that each sample consists of two values, which I imagine simply means the audio is recorded in stereo.
Now compare this to the audio data obtained when using Chromium:

(44100, array([0, 0, 0, ..., 0, 0, 0], dtype=int16))

In this case, each sample consists of only a single value, perhaps suggesting that audio is recorded in mono instead of stereo. In any case, this different data format causes the aforementioned audioop.error. This PR provides a workaround for this, by simply stacking the sample data column-wise, if is discovered that it is not already a nested numpy array.

Issue 2: Firefox only (Gradio Issue)

Reports: #5920

Description

It is reported that the UI in Firefox is sluggish and even causes the browser to crash after a number of recordings have been made. The source of the problem remains to be identified.

Here's what I have found out so far:
After stopping a recording in Firefox, it seems some JavaScript function is called over and over again indefinitely, each time triggering the error Invalid URI. Load of media resource failed. This leads to excessive CPU usage even when the web UI is idle after recording audio. It appears that after each recording, another asynchronous function starts calling the problematic function, leading to CPU usage increasing further with each recording, until the web UI becomes laggy and Firefox finally crashes.

Update: This error even occurs with a minimal POC Gradio program, so I have created an issue in Gradio's repo: gradio-app/gradio#8135.

Checklist:

oobabooga and others added 18 commits February 14, 2024 11:32
@TimStrauven
Copy link

TimStrauven commented Apr 25, 2024

Hi, I posted
Whisper STT overhaul #5563
a while ago, to address also some of the STT issues and moving away from the speechrecognition lib.
Code there might help to implement this one? (Still needs "audio.stop_recording" like you changed in the other PR, and needs to be extended for other architectures than only cuda and cpu)

@mamei16
Copy link
Contributor Author

mamei16 commented Apr 25, 2024

Hi, I posted Whisper STT overhaul #5563 a while ago, to address also some of the STT issues and moving away from the speechrecognition lib. Code there might help to implement this one? (Still needs "audio.stop_recording" like you changed in the other PR, and needs to be extended for other architectures than only cuda and cpu)

Hi, that overhaul definitely contains some nice changes, especially removing the need for the speech-recognition dependency! If you'd like to polish it a little to make it "merge-ready", I propose you make a fork of the main webUI where we can collaborate on that.

@oobabooga
Copy link
Owner

@mamei16 is this PR ready for merging? I see that you managed to put the record button next to the Generate button. I had tried to do that myself in the past and failed, so thanks for that.

@mamei16
Copy link
Contributor Author

mamei16 commented May 21, 2024

@mamei16 is this PR ready for merging? I see that you managed to put the record button next to the Generate button. I had tried to do that myself in the past and failed, so thanks for that.

Yeah, I think so. I've also created another version based on @TimStrauven's overhaul, but due to some bug it barely recognizes anything (even though it should be more "correct" than this version, since it's closer to what openai is doing).

Firefox still crashes after a number of transcriptions, but unfortunately the Gradio devs have yet to react to my issue in any way, so not much to do there :/

@mamei16 mamei16 marked this pull request as ready for review May 21, 2024 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants