VAD audio chunking #135

jkrukowski · 2024-05-06T16:18:24Z

This PR introduces audio chunking with VAD. The VAD is used to detect speech segments in the audio file and then the audio is split into chunks based on the detected speech segments (and padded with zeros to match the 30sec length). Chunks are then processed in a batch resulting in a significant speedup.

Some benchmarks (on my mac book air m1):

Audio file 12:16 length

with VAD:

38.16s user 5.86s system 470% cpu 9.349 total

without VAD:

33.25s user 3.55s system 132% cpu 27.678 total

Audio file 40:26 length

with VAD:

126.54s user 18.41s system 500% cpu 28.952 total

without VAD:

96.55s user 10.47s system 133% cpu 1:20.08 total

To use it in WhisperKitCLI the user has to pass the chunking-strategy flag:

swift run -c release whisperkit-cli transcribe --audio-path /path/to/audio.wav --chunking-strategy vad

atiorh · 2024-05-06T17:17:28Z

Great work @jkrukowski! Did you see the Cut and Merge strategy in https://github.com/m-bain/whisperX?

If we don't attempt to pack short segments into 30 seconds like above before padding, the worst case performance might regress below baseline (e.g. padding ~1-3s chunks to 30s a lot). Let me know if you think Cut and Merge is an extension we should leave as future work or bundle here :)

Edit: Cut and Merge will also mean some additional bookkeeping to adjust word-level timetamps post-inference.

atiorh · 2024-05-06T17:43:20Z

@Abhinay1997 Do you mind rebasing on top of this PR so we can add a WER check (w/ and w/o VAD-based chunking) on your long audio test sample? 🙏

Abhinay1997 · 2024-05-06T17:45:30Z

Hey @atiorh ! No worries, I'll do that by tomorrow. Want to make sure there are no bugs/crash prone code in my PR.

jkrukowski · 2024-05-07T07:50:36Z

Great work @jkrukowski! Did you see the Cut and Merge strategy in https://github.com/m-bain/whisperX?

If we don't attempt to pack short segments into 30 seconds like above before padding, the worst case performance might regress below baseline (e.g. padding ~1-3s chunks to 30s a lot). Let me know if you think Cut and Merge is an extension we should leave as future work or bundle here :)

Edit: Cut and Merge will also mean some additional bookkeeping to adjust word-level timetamps post-inference.

I'd leave it as a future work if possible. After talking to @ZachNagengast the other day I took a bit different approach here -- using VAD I'm trying to find the best cut off point in the 2nd half of 30sec audio chunk. So there is no risk of having a bunch of small segments padded with zeros (because the segment will contain at least 15 sec of the original audio). Having said that I think that cut and merge is a better (but more complicated) approach

atiorh · 2024-05-07T16:08:32Z

Great work @jkrukowski! Did you see the Cut and Merge strategy in https://github.com/m-bain/whisperX?
If we don't attempt to pack short segments into 30 seconds like above before padding, the worst case performance might regress below baseline (e.g. padding ~1-3s chunks to 30s a lot). Let me know if you think Cut and Merge is an extension we should leave as future work or bundle here :)
Edit: Cut and Merge will also mean some additional bookkeeping to adjust word-level timetamps post-inference.

I'd leave it as a future work if possible. After talking to @ZachNagengast the other day I took a bit different approach here -- using VAD I'm trying to find the best cut off point in the 2nd half of 30sec audio chunk. So there is no risk of having a bunch of small segments padded with zeros (because the segment will contain at least 15 sec of the original audio). Having said that I think that cut and merge is a better (but more complicated) approach

Makes sense, this is great.

ZachNagengast

Looks really nice 👍 just a couple suggestions for clarity and future adaptability.

ZachNagengast · 2024-05-07T16:25:16Z

Sources/WhisperKit/Core/EnergyVAD.swift

+        self.energyThreshold = energyThreshold
+    }
+
+    func voiceActivity(in waveform: [Float]) -> [Bool] {


It would be nice to have a public helper method that returns the exact clip timestamps in case someone wants to run the chunking ahead of time in their app and pass them directly via clipTimestamps decoding option.

e.g.

let clips: [Int] = EnergyVAD().voiceActivity(in: audioArray) let options = DecodingOptions(clipTimestamps: clips)

Added calculateNonSilentChunks method in AudioProcessor which is backed by EnergyVAD, this way we can keep EnergyVAD class internal

ZachNagengast · 2024-05-07T16:28:05Z

Sources/WhisperKit/Core/EnergyVAD.swift

+import Accelerate
+
+/// Voice activity detection based on energy threshold
+final class EnergyVAD {


We have some other VAD code in the repo, could you consolidate that code by using this new class? May need a protocol for the other vad methods we have coming up (mel analysis, ML based), but your call on API design. I think the enum and existing chunking protocol solves this pretty well too fwiw.

for now I decided to move this other VAD code in our repo to isVoiceDetected method in AudioProcessor. This way we can keep EnergyVAD internal till the full public interface of this class is ready

Sources/WhisperKit/Core/EnergyVAD.swift

Support chunking VAD for paths

added audio chunker, added energy based vad, added tests

2d43a95

jkrukowski mentioned this pull request May 6, 2024

Audio chunking #125

Closed

jkrukowski added 2 commits May 6, 2024 18:24

fixed compilation

77fcfd0

fixed compilation

9170083

extracted prepareSeekClips function

57a5c3d

ZachNagengast requested changes May 7, 2024

View reviewed changes

review changes

5da2669

jkrukowski requested a review from ZachNagengast May 7, 2024 20:11

ZachNagengast and others added 4 commits May 15, 2024 23:12

Support chunking VAD for paths

65cb888

Merge branch 'main' into vad-chunking

23a7752

Updates from review

bbd07ce

Merge pull request #1 from argmaxinc/vad-chunking

4cdae1c

Support chunking VAD for paths

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAD audio chunking #135

VAD audio chunking #135

jkrukowski commented May 6, 2024 •

edited

atiorh commented May 6, 2024 •

edited

atiorh commented May 6, 2024 •

edited

Abhinay1997 commented May 6, 2024

jkrukowski commented May 7, 2024 •

edited

atiorh commented May 7, 2024

ZachNagengast left a comment

ZachNagengast May 7, 2024

jkrukowski May 7, 2024

ZachNagengast May 7, 2024

jkrukowski May 7, 2024

VAD audio chunking #135

Are you sure you want to change the base?

VAD audio chunking #135

Conversation

jkrukowski commented May 6, 2024 • edited

Audio file 12:16 length

Audio file 40:26 length

atiorh commented May 6, 2024 • edited

atiorh commented May 6, 2024 • edited

Abhinay1997 commented May 6, 2024

jkrukowski commented May 7, 2024 • edited

atiorh commented May 7, 2024

ZachNagengast left a comment

Choose a reason for hiding this comment

ZachNagengast May 7, 2024

Choose a reason for hiding this comment

jkrukowski May 7, 2024

Choose a reason for hiding this comment

ZachNagengast May 7, 2024

Choose a reason for hiding this comment

jkrukowski May 7, 2024

Choose a reason for hiding this comment

jkrukowski commented May 6, 2024 •

edited

atiorh commented May 6, 2024 •

edited

atiorh commented May 6, 2024 •

edited

jkrukowski commented May 7, 2024 •

edited