Resample audio file in chunks to reduce memory usage #16

finnvoor · 2024-02-06T12:33:48Z

WhisperKit/Sources/WhisperKit/Core/AudioProcessor.swift

Lines 197 to 217 in fed90c7

    
           let newFrameLength = Int64((sampleRate / audioFile.fileFormat.sampleRate) * Double(audioFile.length)) 
        
           let outputFormat = AVAudioFormat(standardFormatWithSampleRate: sampleRate, channels: channelCount)! 
        
           guard let converter = AVAudioConverter(from: audioFile.processingFormat, to: outputFormat) else { 
        
               Logging.error("Failed to create audio converter") 
        
               return nil 
        
           } 
        
           let frameCount = AVAudioFrameCount(audioFile.length) 
        
           guard let inputBuffer = AVAudioPCMBuffer(pcmFormat: audioFile.processingFormat, frameCapacity: frameCount), 
        
                 let outputBuffer = AVAudioPCMBuffer(pcmFormat: outputFormat, frameCapacity: AVAudioFrameCount(newFrameLength)) 
        
           else { 
        
               Logging.error("Unable to create buffers, likely due to unsupported file format") 
        
               return nil 
        
           } 
        
           do { 
        
               try audioFile.read(into: inputBuffer, frameCount: frameCount) 
        
           } catch { 
        
               Logging.error("Error reading audio file: \(error)") 
        
               return nil 
        
           }

Creating an AVAudioPCMBuffer for the whole input audio buffer can easily surpass iOS memory limits.

Attempting to transcribe a 44100hz, 2 channel, ~1hr long video crashes on iOS due to running out of memory. It would be nice if instead of reading all the input audio into a buffer at once and converting, the audio was read and converted in chunks to reduce the memory usage.

Another less common issue that would be solved by chunking the audio is that AVAudioPCMBuffer has a max size of UInt32.max, which can be hit when transcribing a 1-2hr, 16 channel, 44100hz audio file. This is a fairly typical audio file for a podcast recorded with a RODECaster Pro.

The text was updated successfully, but these errors were encountered:

ZachNagengast · 2024-02-07T23:18:13Z

Hi @finnvoor totally makes sense thanks for reporting this - there is an option to try that I'll recommend with the current codebase, and a path we could take moving forward I'm curious about your feedback on.

First option would be handling the chunking on the app side by using the transcribe interface that accepts an audioArray:

    public func transcribe(audioArray: [Float],
                           decodeOptions: DecodingOptions? = nil,
                           callback: TranscriptionCallback = nil) async throws -> TranscriptionResult?

Psuedo code for that would look similar to how to do streaming:

Generate a 30s array of samples from the audio file

        var currentSeek = 0
        guard let audioFile = try? AVAudioFile(forReading: URL(string: audioFilePath)!) else { return nil }
        audioFile.framePosition = currentSeek
        let inputBuffer = AVAudioPCMBuffer(pcmFormat: audioFile.processingFormat, frameCapacity: AVAudioFrameCount(audioFile.fileFormat.sampleRate * 30.0))
        try? audioFile.read(into: inputBuffer!)

Convert it to 16khz 1 channel

        let desiredFormat = AVAudioFormat(
            commonFormat: .pcmFormatFloat32,
            sampleRate: Double(WhisperKit.sampleRate),
            channels: AVAudioChannelCount(1),
            interleaved: false
        )!
        let converter = AVAudioConverter(from: audioFile.processingFormat, to: desiredFormat)
        let audioArray = try? AudioProcessor.resampleBuffer(inputBuffer!, with: converter!)

Transcribe that section and find the last index of the sample we have transcribed so far

        let transcribeResult = try await whisperKit.transcribe(audioArray: audioArray, decodeOptions: options)
        let nextSeek = (transcribeResult?.segments.last?.end)! * Float(WhisperKit.sampleRate)

Restart from step one using that as the new frame position

        audioFile.framePosition = currentSeek + nextSeek

Using this you could generate a multitude of TranscriptionResults and merge them together as they come in. This is similar to how we do streaming in the example app.

As for a new option that would make this easier & built in - there might be a protocol method we'd want to add that simply requests audio from the input file at predefined intervals (like 20s -> 50s, 50s -> 80s) and loads from disk rather than storing it all in memory. That way when we reach the end of the current 30s and update the seek point, we could request the next window from whatever is available on disk, otherwise end the loop.

We have also been thinking about a way to use the "steaming" logic for static audio files from disk (bulk transcription is an upcoming focus for us) so this might be a good way to go to keep the codebase simple, but curious to hear what you think?

finnvoor · 2024-02-08T10:50:37Z

Thanks for the info! We can definitely split the audio and transcribe in chunks ourselves, but what I like so much about WhisperKit is how it handles all the annoying bits for you, so I think it would be nice if it would split large files automatically.

Ideally we could just pass a URL to any length file and get back a transcript, for our use case we don't need any streaming, but the protocol method could work (just a bit more effort on the client side).

I do think the easiest and simplest way to fix these bugs is just to add a loop in resampleAudio to read the input file in chunks (the input file could easily pass memory limits, but the resampled audio would have to be incredibly long to hit any memory limits), but understand if you want a more general solution.

vade · 2024-02-21T15:32:10Z

Many moons ago I wrote a pure AVFoundation based CMSampleBuffer decoder which only keeps the 30 seconds of memory buffers available - so you never go above that:

Im unsure if its helpful, but you can find the code where: https://github.com/vade/OpenAI-Whisper-CoreML/blob/feature/RosaKit/Whisper/Whisper/Whisper/Whisper.swift#L361

I lost steam on my Whisper CoreML port, but would be happy to contribute if anything I can add is helpful!

ZachNagengast · 2024-02-22T05:32:17Z

@vade This looks nice, thanks for sharing!

ZachNagengast added enhancement Improves existing code help wanted Extra attention is needed labels Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resample audio file in chunks to reduce memory usage #16

Resample audio file in chunks to reduce memory usage #16

finnvoor commented Feb 6, 2024

ZachNagengast commented Feb 7, 2024

finnvoor commented Feb 8, 2024

vade commented Feb 21, 2024

ZachNagengast commented Feb 22, 2024

Resample audio file in chunks to reduce memory usage #16

Resample audio file in chunks to reduce memory usage #16

Comments

finnvoor commented Feb 6, 2024

ZachNagengast commented Feb 7, 2024

finnvoor commented Feb 8, 2024

vade commented Feb 21, 2024

ZachNagengast commented Feb 22, 2024