Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting audio length limitations for OpenAI Whisper API. #680

Open
xmoiduts opened this issue Feb 18, 2024 · 9 comments
Open

Documenting audio length limitations for OpenAI Whisper API. #680

xmoiduts opened this issue Feb 18, 2024 · 9 comments

Comments

@xmoiduts
Copy link

When using the OpenAI Whisper API option for transcribing a long audio (1 hour), buzz will report

"Failed (Maximum content size limit (26214400) exceeded (2629xxxx) bytes read) {..."

According to the OpenAI community, whisper API has a file limit of 25MB, while buzz converts input files to PCM (16000 sample rate), which means that audios exceeding around 800 seconds will result in error (I tried 850 seconds and the error happens).

I recommend:

  1. Documenting the length limit;
    or,
  2. To contain longer audios, try resampling the input media to a higher-compression format for OpenAI Whisper API jobs when the input audio is already at a low bitrate.
@ccchan234
Copy link

ccchan234 commented Feb 20, 2024

gpt polite, short version:
Hello, I encountered an issue with Buzz while processing a 13MB .m4a file from a 27-minute video. It failed, but my custom Python script succeeded. This suggests a potential issue with Buzz. Could it be looked into? Thank you.

long, human, uneducated version
hi, for a 27min video,
i extracted the .m4a which is 13MB.
i feed into buzz, it failed as above.
but when i use MY OWN python script, it worked.
there is something wrong /w buzz then.
need improvement.
thanks

@ccchan234
Copy link

ccchan234 commented Feb 21, 2024

i am not sure what's wrong /w buzz's part of code to transcribe,

i just paste mine (relevant part) here:

i am no programmer, i got these from chatgpt plus.

`
class AudioTranscriberApp:
def init(self, master):
self.master = master
master.title("Audio Transcriber")

    self.label = tk.Label(master, text="Choose an audio file to transcribe")
    self.label.pack()

    # Language selection
    self.language_label = tk.Label(master, text="Select Language:")
    self.language_label.pack()
    
    self.language = tk.StringVar()
    self.language_combobox = ttk.Combobox(master, textvariable=self.language)
    # Include the most common languages and then the UN + G7 + Korean languages
    self.language_combobox['values'] = (
        'English (en)', 'Chinese (zh)', 'Japanese (ja)', 'Korean (ko)',  # Most common
        'French (fr)', 'Russian (ru)', 'Spanish (es)', 'Arabic (ar)',  # UN languages
        'German (de)', 'Italian (it)'  # G7 languages
    )
    self.language_combobox['state'] = 'readonly'  # Prevent user from typing a value
    self.language_combobox.set('English (en)')  # Set default value
    self.language_combobox.pack()

    self.transcribe_button = tk.Button(master, text="Choose File", command=self.transcribe_audio)
    self.transcribe_button.pack()

    self.close_button = tk.Button(master, text="Close", command=master.quit)
    self.close_button.pack()

def transcribe_audio(self):
    audio_file_path = filedialog.askopenfilename(
        filetypes=[("Audio Files", "*.mp3 *.wav *.m4a"), ("All Files", "*.*")]
    )

    if not audio_file_path:
        # If no file is selected, do nothing
        return

    # Extract the language code from the selection
    language_code = self.language.get().split(' ')[-1].strip('()')

    base_filename = os.path.splitext(audio_file_path)[0]
    srt_file_path = self.generate_new_filename(f"{base_filename}.srt")
    txt_file_path = self.generate_new_filename(f"{base_filename}.txt")

    try:
        # Open the audio file in binary mode and request transcription in the selected language
        with open(audio_file_path, "rb") as audio_file:
            transcript_response = openai.Audio.transcribe(
                file=audio_file,
                model="whisper-1",
                response_format="srt",
                language=language_code
            )
            
        transcription_text = transcript_response['choices'][0]['text'] if 'choices' in transcript_response else transcript_response

        # Save the SRT transcription
        with open(srt_file_path, 'w') as srt_file:
            srt_file.write(transcription_text)
        print("Transcription (SRT format) saved to:", srt_file_path)

        # Convert SRT to plain text and save
        with open(srt_file_path, 'r') as srt_file, open(txt_file_path, 'w') as txt_file:
            for line in srt_file:
                if line.strip().isdigit() or line.strip() == '' or '-->' in line:
                    continue
                txt_file.write(line)
        print("Plain text saved to:", txt_file_path)

        messagebox.showinfo("Success", "Transcription completed successfully!\nFiles have been saved.")

    except Exception as e:
        messagebox.showerror("Error", f"An error occurred: {e}")

`

@ccchan234
Copy link

When using the OpenAI Whisper API option for transcribing a long audio (1 hour), buzz will report

"Failed (Maximum content size limit (26214400) exceeded (2629xxxx) bytes read) {..."

According to the OpenAI community, whisper API has a file limit of 25MB, while buzz converts input files to PCM (16000 sample rate), which means that audios exceeding around 800 seconds will result in error (I tried 850 seconds and the error happens).

I recommend:

  1. Documenting the length limit;
    or,
  2. To contain longer audios, try resampling the input media to a higher-compression format for OpenAI Whisper API jobs when the input audio is already at a low bitrate.

for an education video, a 3:30 video need 1.7MB of .m4a (smaller than .mp3).

so a video of about 49min could be processed by my script in theory, not too bad.

but i did saw someone have script to cut it into parts then submit individually.

i'll later borrow code from the script if have time.

@chidiwilliams
Copy link
Owner

Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

@ccchan234
Copy link

Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

hey, on website someone showed how to split the file,
grab the srt and txt,
THEN join them together.

i am a naive programmer using chatgpt, i could copycat that.

indeep i am still writing it yesterday.

do you think you can incorporate it,
or show me the function, i use chatgpt to implement it for you?

i dont want to re-invent the wheel.

that author's youtube video on this:
https://www.youtube.com/watch?v=-FtsoKryhPY&t=638s

the lab on colab:
https://colab.research.google.com/github/ywchiu/largitdata/blob/master/code/Course_221.ipynb#scrollTo=XfeeQGUQQOwx

page on github:
https://github.com/ywchiu/largitdata/blob/master/code/Course_221.ipynb

essentially:
split using pydub: (there are some chinese as comments, i asked gpt to translate the comments and DONT touch the code.
pls verify with the code above as GPT sometimes malfunction)
#@title Split YouTube Video
from pydub import AudioSegment

#@markdown ### Length of the segment to split (in milliseconds):
segment_length = 1000000 #@param {type:"integer"}

Load the MP3 audio file

sound = AudioSegment.from_file(f'{filename}.mp3', format='mp3')

sound_track = []

Split the audio file into multiple files

for i, chunk in enumerate(sound[::segment_length]):
# Set the filename for the split file
chunk.export(f'output_{i}.mp3', format='mp3')
audio_file = open(f'output_{i}.mp3', "rb")
sound_track.append(audio_file)

he use jupyter/colab so the codes are in blocks.
his idea is to split the file, fixed at 1000000ms,
(very important, otherwise it's hard to calculate SRT times)
then submit to whisper api in a loop cycle.
then, for each SRT file, maksure the maximum timestamp is <= the file time.

code:
max_time = pysrt.SubRipTime(seconds = 1000)
for sub in subtitles1:
sub.start = sub.start if sub.start < max_time else max_time
sub.end = sub.end if sub.end < max_time else max_time

then join the SRT (and txt file, easy for txt file).
the timestamp for each file need to be recalculate with respect to the location of the file in the loop.

code:
shift_time = pysrt.SubRipTime(seconds = 1000)
for sub in subtitles2:
sub.start = sub.start + shift_time
sub.end = sub.end + shift_time

this will make BUZZ much more useful!

@ccchan234
Copy link

i need such function deadly now,
so i think i'll try to implement in simple python scripts in this weekend.

if you are interested, pls just let me know,
i guess you use python too.

then you tell me the part that deal /w the processing, i could implement into it.

ps i am no programmer,
i could code mostly due to some courses in college and chatgpt so pls dont think i could do it perfectly alone.

thanks

@ccchan234
Copy link

ps: i tried to do the spliting /w .m4a /w pydub and ffmpeg, but i failed.
.mp3 is much more easy in last nite. (the above colab also use mp3, it's larger in size but seems easier than .m4a)

so recently i'll try again with .mp3.

@ccchan234
Copy link

ccchan234 commented Feb 24, 2024

Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

hi,
i see you are implementing it also?
https://app.codecov.io/gh/chidiwilliams/buzz/pull/652/blob/buzz/transcriber.py

ok, i'll wait for your reply at the weekend.
i wish both SRT and TXT are woking well.
thank you.


<html>
<body>
<!--StartFragment-->
# If the file is larger than 25MB, split into chunks
--
326 | # and transcribe each chunk separately
327 | num_chunks = math.ceil(total_size / max_chunk_size)
328 | chunk_duration = duration_secs / num_chunks
329 |  
330 | segments = []
331 |  
332 | for i in range(num_chunks):
333 | chunk_start = i * chunk_duration
334 | chunk_end = min((i + 1) * chunk_duration, dura

<!--EndFragment-->
</body>
</html>

@ccchan234
Copy link

Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

hey, on website someone showed how to split the file, grab the srt and txt, THEN join them together.

i am a naive programmer using chatgpt, i could copycat that.

indeep i am still writing it yesterday.

do you think you can incorporate it, or show me the function, i use chatgpt to implement it for you?

i dont want to re-invent the wheel.

that author's youtube video on this: https://www.youtube.com/watch?v=-FtsoKryhPY&t=638s

the lab on colab: https://colab.research.google.com/github/ywchiu/largitdata/blob/master/code/Course_221.ipynb#scrollTo=XfeeQGUQQOwx

page on github: https://github.com/ywchiu/largitdata/blob/master/code/Course_221.ipynb

essentially: split using pydub: (there are some chinese as comments, i asked gpt to translate the comments and DONT touch the code. pls verify with the code above as GPT sometimes malfunction) #@title Split YouTube Video from pydub import AudioSegment

#@markdown ### Length of the segment to split (in milliseconds): segment_length = 1000000 #@param {type:"integer"}

Load the MP3 audio file

sound = AudioSegment.from_file(f'{filename}.mp3', format='mp3')

sound_track = []

Split the audio file into multiple files

for i, chunk in enumerate(sound[::segment_length]): # Set the filename for the split file chunk.export(f'output_{i}.mp3', format='mp3') audio_file = open(f'output_{i}.mp3', "rb") sound_track.append(audio_file)

he use jupyter/colab so the codes are in blocks. his idea is to split the file, fixed at 1000000ms, (very important, otherwise it's hard to calculate SRT times) then submit to whisper api in a loop cycle. then, for each SRT file, maksure the maximum timestamp is <= the file time.

code: max_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles1: sub.start = sub.start if sub.start < max_time else max_time sub.end = sub.end if sub.end < max_time else max_time

then join the SRT (and txt file, easy for txt file). the timestamp for each file need to be recalculate with respect to the location of the file in the loop.

code: shift_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles2: sub.start = sub.start + shift_time sub.end = sub.end + shift_time

this will make BUZZ much more useful!

hi, for those who want a temp solution,
the above colab script is a good start.

it helps to split and merge the video, one by one.

would be useful while we wait for the update.

Buzz is good that it could handle many files at one (but not recursively, and no need to as likely people wont do that, as recusively process media files will be very CPU demanding).

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants