Documenting audio length limitations for OpenAI Whisper API. #680

xmoiduts · 2024-02-18T16:11:40Z

When using the OpenAI Whisper API option for transcribing a long audio (1 hour), buzz will report

"Failed (Maximum content size limit (26214400) exceeded (2629xxxx) bytes read) {..."

According to the OpenAI community, whisper API has a file limit of 25MB, while buzz converts input files to PCM (16000 sample rate), which means that audios exceeding around 800 seconds will result in error (I tried 850 seconds and the error happens).

I recommend:

Documenting the length limit;
or,
To contain longer audios, try resampling the input media to a higher-compression format for OpenAI Whisper API jobs when the input audio is already at a low bitrate.

ccchan234 · 2024-02-20T20:42:23Z

gpt polite, short version:
Hello, I encountered an issue with Buzz while processing a 13MB .m4a file from a 27-minute video. It failed, but my custom Python script succeeded. This suggests a potential issue with Buzz. Could it be looked into? Thank you.

long, human, uneducated version
hi, for a 27min video,
i extracted the .m4a which is 13MB.
i feed into buzz, it failed as above.
but when i use MY OWN python script, it worked.
there is something wrong /w buzz then.
need improvement.
thanks

ccchan234 · 2024-02-21T07:09:16Z

i am not sure what's wrong /w buzz's part of code to transcribe,

i just paste mine (relevant part) here:

i am no programmer, i got these from chatgpt plus.

`
class AudioTranscriberApp:
def init(self, master):
self.master = master
master.title("Audio Transcriber")

    self.label = tk.Label(master, text="Choose an audio file to transcribe")
    self.label.pack()

    # Language selection
    self.language_label = tk.Label(master, text="Select Language:")
    self.language_label.pack()
    
    self.language = tk.StringVar()
    self.language_combobox = ttk.Combobox(master, textvariable=self.language)
    # Include the most common languages and then the UN + G7 + Korean languages
    self.language_combobox['values'] = (
        'English (en)', 'Chinese (zh)', 'Japanese (ja)', 'Korean (ko)',  # Most common
        'French (fr)', 'Russian (ru)', 'Spanish (es)', 'Arabic (ar)',  # UN languages
        'German (de)', 'Italian (it)'  # G7 languages
    )
    self.language_combobox['state'] = 'readonly'  # Prevent user from typing a value
    self.language_combobox.set('English (en)')  # Set default value
    self.language_combobox.pack()

    self.transcribe_button = tk.Button(master, text="Choose File", command=self.transcribe_audio)
    self.transcribe_button.pack()

    self.close_button = tk.Button(master, text="Close", command=master.quit)
    self.close_button.pack()

def transcribe_audio(self):
    audio_file_path = filedialog.askopenfilename(
        filetypes=[("Audio Files", "*.mp3 *.wav *.m4a"), ("All Files", "*.*")]
    )

    if not audio_file_path:
        # If no file is selected, do nothing
        return

    # Extract the language code from the selection
    language_code = self.language.get().split(' ')[-1].strip('()')

    base_filename = os.path.splitext(audio_file_path)[0]
    srt_file_path = self.generate_new_filename(f"{base_filename}.srt")
    txt_file_path = self.generate_new_filename(f"{base_filename}.txt")

    try:
        # Open the audio file in binary mode and request transcription in the selected language
        with open(audio_file_path, "rb") as audio_file:
            transcript_response = openai.Audio.transcribe(
                file=audio_file,
                model="whisper-1",
                response_format="srt",
                language=language_code
            )
            
        transcription_text = transcript_response['choices'][0]['text'] if 'choices' in transcript_response else transcript_response

        # Save the SRT transcription
        with open(srt_file_path, 'w') as srt_file:
            srt_file.write(transcription_text)
        print("Transcription (SRT format) saved to:", srt_file_path)

        # Convert SRT to plain text and save
        with open(srt_file_path, 'r') as srt_file, open(txt_file_path, 'w') as txt_file:
            for line in srt_file:
                if line.strip().isdigit() or line.strip() == '' or '-->' in line:
                    continue
                txt_file.write(line)
        print("Plain text saved to:", txt_file_path)

        messagebox.showinfo("Success", "Transcription completed successfully!\nFiles have been saved.")

    except Exception as e:
        messagebox.showerror("Error", f"An error occurred: {e}")

`

ccchan234 · 2024-02-21T07:13:37Z

When using the OpenAI Whisper API option for transcribing a long audio (1 hour), buzz will report

"Failed (Maximum content size limit (26214400) exceeded (2629xxxx) bytes read) {..."

According to the OpenAI community, whisper API has a file limit of 25MB, while buzz converts input files to PCM (16000 sample rate), which means that audios exceeding around 800 seconds will result in error (I tried 850 seconds and the error happens).

I recommend:

Documenting the length limit;
or,

To contain longer audios, try resampling the input media to a higher-compression format for OpenAI Whisper API jobs when the input audio is already at a low bitrate.

for an education video, a 3:30 video need 1.7MB of .m4a (smaller than .mp3).

so a video of about 49min could be processed by my script in theory, not too bad.

but i did saw someone have script to cut it into parts then submit individually.

i'll later borrow code from the script if have time.

chidiwilliams · 2024-02-24T00:31:49Z

Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

ccchan234 · 2024-02-24T03:25:08Z

Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

hey, on website someone showed how to split the file,
grab the srt and txt,
THEN join them together.

i am a naive programmer using chatgpt, i could copycat that.

indeep i am still writing it yesterday.

do you think you can incorporate it,
or show me the function, i use chatgpt to implement it for you?

i dont want to re-invent the wheel.

that author's youtube video on this:
https://www.youtube.com/watch?v=-FtsoKryhPY&t=638s

the lab on colab:
https://colab.research.google.com/github/ywchiu/largitdata/blob/master/code/Course_221.ipynb#scrollTo=XfeeQGUQQOwx

page on github:
https://github.com/ywchiu/largitdata/blob/master/code/Course_221.ipynb

essentially:
split using pydub: (there are some chinese as comments, i asked gpt to translate the comments and DONT touch the code.
pls verify with the code above as GPT sometimes malfunction)
#@title Split YouTube Video
from pydub import AudioSegment

#@markdown ### Length of the segment to split (in milliseconds):
segment_length = 1000000 #@param {type:"integer"}

Load the MP3 audio file

sound = AudioSegment.from_file(f'{filename}.mp3', format='mp3')

sound_track = []

Split the audio file into multiple files

for i, chunk in enumerate(sound[::segment_length]):
# Set the filename for the split file
chunk.export(f'output_{i}.mp3', format='mp3')
audio_file = open(f'output_{i}.mp3', "rb")
sound_track.append(audio_file)

he use jupyter/colab so the codes are in blocks.
his idea is to split the file, fixed at 1000000ms,
(very important, otherwise it's hard to calculate SRT times)
then submit to whisper api in a loop cycle.
then, for each SRT file, maksure the maximum timestamp is <= the file time.

code:
max_time = pysrt.SubRipTime(seconds = 1000)
for sub in subtitles1:
sub.start = sub.start if sub.start < max_time else max_time
sub.end = sub.end if sub.end < max_time else max_time

then join the SRT (and txt file, easy for txt file).
the timestamp for each file need to be recalculate with respect to the location of the file in the loop.

code:
shift_time = pysrt.SubRipTime(seconds = 1000)
for sub in subtitles2:
sub.start = sub.start + shift_time
sub.end = sub.end + shift_time

this will make BUZZ much more useful!

ccchan234 · 2024-02-24T03:26:56Z

i need such function deadly now,
so i think i'll try to implement in simple python scripts in this weekend.

if you are interested, pls just let me know,
i guess you use python too.

then you tell me the part that deal /w the processing, i could implement into it.

ps i am no programmer,
i could code mostly due to some courses in college and chatgpt so pls dont think i could do it perfectly alone.

thanks

ccchan234 · 2024-02-24T03:29:14Z

ps: i tried to do the spliting /w .m4a /w pydub and ffmpeg, but i failed.
.mp3 is much more easy in last nite. (the above colab also use mp3, it's larger in size but seems easier than .m4a)

so recently i'll try again with .mp3.

ccchan234 · 2024-02-24T03:34:47Z

Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

hi,
i see you are implementing it also?
https://app.codecov.io/gh/chidiwilliams/buzz/pull/652/blob/buzz/transcriber.py

ok, i'll wait for your reply at the weekend.
i wish both SRT and TXT are woking well.
thank you.


<html>
<body>
<!--StartFragment-->
# If the file is larger than 25MB, split into chunks
--
326 | # and transcribe each chunk separately
327 | num_chunks = math.ceil(total_size / max_chunk_size)
328 | chunk_duration = duration_secs / num_chunks
329 |  
330 | segments = []
331 |  
332 | for i in range(num_chunks):
333 | chunk_start = i * chunk_duration
334 | chunk_end = min((i + 1) * chunk_duration, dura

<!--EndFragment-->
</body>
</html>

ccchan234 · 2024-02-26T06:17:29Z

Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

hey, on website someone showed how to split the file, grab the srt and txt, THEN join them together.

i am a naive programmer using chatgpt, i could copycat that.

indeep i am still writing it yesterday.

do you think you can incorporate it, or show me the function, i use chatgpt to implement it for you?

i dont want to re-invent the wheel.

that author's youtube video on this: https://www.youtube.com/watch?v=-FtsoKryhPY&t=638s

the lab on colab: https://colab.research.google.com/github/ywchiu/largitdata/blob/master/code/Course_221.ipynb#scrollTo=XfeeQGUQQOwx

page on github: https://github.com/ywchiu/largitdata/blob/master/code/Course_221.ipynb

essentially: split using pydub: (there are some chinese as comments, i asked gpt to translate the comments and DONT touch the code. pls verify with the code above as GPT sometimes malfunction) #@title Split YouTube Video from pydub import AudioSegment

#@markdown ### Length of the segment to split (in milliseconds): segment_length = 1000000 #@param {type:"integer"}

Load the MP3 audio file

sound = AudioSegment.from_file(f'{filename}.mp3', format='mp3')

sound_track = []

Split the audio file into multiple files

for i, chunk in enumerate(sound[::segment_length]): # Set the filename for the split file chunk.export(f'output_{i}.mp3', format='mp3') audio_file = open(f'output_{i}.mp3', "rb") sound_track.append(audio_file)

he use jupyter/colab so the codes are in blocks. his idea is to split the file, fixed at 1000000ms, (very important, otherwise it's hard to calculate SRT times) then submit to whisper api in a loop cycle. then, for each SRT file, maksure the maximum timestamp is <= the file time.

code: max_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles1: sub.start = sub.start if sub.start < max_time else max_time sub.end = sub.end if sub.end < max_time else max_time

then join the SRT (and txt file, easy for txt file). the timestamp for each file need to be recalculate with respect to the location of the file in the loop.

code: shift_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles2: sub.start = sub.start + shift_time sub.end = sub.end + shift_time

this will make BUZZ much more useful!

hi, for those who want a temp solution,
the above colab script is a good start.

it helps to split and merge the video, one by one.

would be useful while we wait for the update.

Buzz is good that it could handle many files at one (but not recursively, and no need to as likely people wont do that, as recusively process media files will be very CPU demanding).

thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenting audio length limitations for OpenAI Whisper API. #680

Documenting audio length limitations for OpenAI Whisper API. #680

xmoiduts commented Feb 18, 2024

ccchan234 commented Feb 20, 2024 •

edited

ccchan234 commented Feb 21, 2024 •

edited

ccchan234 commented Feb 21, 2024

chidiwilliams commented Feb 24, 2024

ccchan234 commented Feb 24, 2024

ccchan234 commented Feb 24, 2024

ccchan234 commented Feb 24, 2024

ccchan234 commented Feb 24, 2024 •

edited

ccchan234 commented Feb 26, 2024

Load the MP3 audio file

Split the audio file into multiple files

Documenting audio length limitations for OpenAI Whisper API. #680

Documenting audio length limitations for OpenAI Whisper API. #680

Comments

xmoiduts commented Feb 18, 2024

ccchan234 commented Feb 20, 2024 • edited

ccchan234 commented Feb 21, 2024 • edited

ccchan234 commented Feb 21, 2024

chidiwilliams commented Feb 24, 2024

ccchan234 commented Feb 24, 2024

Load the MP3 audio file

Split the audio file into multiple files

ccchan234 commented Feb 24, 2024

ccchan234 commented Feb 24, 2024

ccchan234 commented Feb 24, 2024 • edited

ccchan234 commented Feb 26, 2024

Load the MP3 audio file

Split the audio file into multiple files

ccchan234 commented Feb 20, 2024 •

edited

ccchan234 commented Feb 21, 2024 •

edited

ccchan234 commented Feb 24, 2024 •

edited