Support for realtime audio input #10

biemster · 2022-10-01T14:07:18Z

Noting that the processing time is considerably shorter than the length of speech, is it possible to feed the models real time microphone output? Or does the inference run on the complete audio stream, instead of sample by sample?

This would greatly reduce the latency for voice assistants and the like, that the audio does not need to be fully captured and only after that fed to the models. Basically the same as I did here with SODA: https://github.com/biemster/gasr, but then with an open source and multilang model.

ggerganov · 2022-10-01T15:43:10Z

The Whisper model processes the audio in chunks of 30 seconds - this is a hard constraint of the architecture.

However, what seems to work is you can take for example 5 seconds of audio and pad it with 25 seconds of silence. This way you can process shorter chunks.

Given that, an obvious strategy for realtime audio transcription is the following:

T  - [data]
-------------------
1  - [1 audio, 29 silence pad] -> transcribe -> "He"
2  - [2 audio, 28 silence pad] -> transcribe -> "Hello"
3  - [3 audio, 27 silence pad] -> transcribe -> "Hello, my"
...
29 - [29 audio, 1 silence pad] -> transcribe -> "Hello, my name is John ..."

The problem with that is you need to do the same amount of computation for 1 second audio as you would do for 2, 3, ... , 30 seconds of audio. So if your audio input step is 1 second (as shown in the example above), you will effectively do 30 times the computation that you would normally do to process the full 30 seconds.

I plan to add a basic example of real-time audio transcription using the above strategy.

biemster · 2022-10-01T15:48:16Z

I was a bit afraid that that would be the answer, but I'll definitely will check out that basic example when it's ready!

- Processes input in chunks of 3 seconds. - Padding audio with silence - Uses 1 second audio from previous pass - No text context

ggerganov · 2022-10-02T15:10:49Z

Just added a very naive implementation of the idea above. To run it, simply do:

# install sdl2 (Ubuntu)
$ sudo apt-get install libsdl2-dev

# install sdl2 (Mac OS)
$ brew install sdl2

# download a model if you don't have one
$ ./download-ggml-model.sh base.en

# run the real-time audio transcription
$ make stream
$ ./stream -m models/ggml-base.en.bin

This example continuously captures audio from the mic and runs whisper on the captured audio.
The time step is currently hardcoded at 3 seconds.

The results are not great because the current implementation can chop the audio in the middle of words.
Also, the text context is reset for every new iteration.

However, all these things can be significantly improved.
Probably we need to add some sort of simple VAD as a preprocessing step.

ggerganov · 2022-10-02T16:28:29Z

Here is a short video demonstration of near real-time transcription from the microphone:

rt_esl_csgo_1.mp4

trholding · 2022-10-02T17:05:08Z

Nice, but somehow can't kill it (stream) I had to do a killall -9 stream. On a AMD 2 thread 3ghz processor with 16GB RAM, there is significant delay. However I found that I get 2x realtime with the usual transcription on audio file. Great work. I love this.

ggerganov · 2022-10-02T17:17:39Z

Thanks for the feedback.
I just pushed a fix that should handle Ctrl+C correctly (it can take a few seconds to respond though).

Regarding the performance - I hope it can be improved with a better strategy to decide when to perform the inference. Currently, it is done every X seconds, regardless of the data. If we add voice detection, we should be able to run it less often. But overall, it seems that real-time transcription will always be slower compared to the original 30-seconds chunk transcription.

trholding · 2022-10-02T17:40:17Z

Thanks for the quick fix. I have some suggestions/ideas, for faster voice transcription. Give me half an hour to one hour, I'll update here with new content.

Edit / Updated:

Here are some ideas to speed up offline non real time transcription:

Removing silence helps a lot in reducing total time of audio (not yet tried but obvious):

http://andrewslotnick.com/posts/speeding-up-a-speech.html#Remove-Silence

Things that I tried with good results:

First I ran an half an hour audio file through https://github.com/xiph/rnnoise code. Then I increased the tempo to 1.5 with sox (tempo preserves pitch). After that I got good results with tiny.en but base.en seemed to be less accurate. Overall process is much faster - real fast transcription except for the initial delay.

cd /tmp
./rnnoise_demo elon16.wav elon16.raw
sox -c 1 -r 16000 -b 16 --encoding signed-integer elon16.raw elon16_denoised.wav
sox elon16_denoised.wav elonT3.wav tempo 1.5
./main -m models/ggml-tiny.en.bin -f /tmp/elonT3.wav

Here are some ideas for faster real time transcription:

I noticed that when I ran this on a 5 sec clip, I got this result:

./main -m models/ggml-tiny.en.bin -f /tmp/rec.wav 
log_mel_spectrogram: recording length: 5.015500 s
...
main: processing 80248 samples (5.0 sec), 2 threads, lang = english, task = transcribe, timestamps = 1 ...

[00:00.000 --> 00:05.000]   Okay, this is a test. I think this will work out nicely.
[00:05.000 --> 00:10.000]   [no audio]
...
main:    total time = 18525.62 ms

Now if we could apply this:

1 VAD / Silence detection(like you mentioned) split into chunks. The result is variable length audio chunks in memory or temp files
2 Remove noise with rrnoise on chunks
3 Speed up chunck by 1.5x preserving pitch (the speed up should just be an option. I learned that anything above 1.5x results are bad except if voice is loud clear and slow to start with, 1.5x is safe. Ideal is 1.1-1.5x max 2x)
4 Since we know exactly how long the sped up chunk is, we won't need to wait for transcription to finish...

Example:
[00:00.000 --> 00:05.000] Okay, this is a test. I think this will work out nicely. <--- We could kill it right here (cos this is the total length of that file / chunck I had as example)
[00:05.000 --> 00:10.000] [no audio] <-- This is processing on empty buffer, when killed would not waste processing

VAD: https://github.com/cirosilvano/easyvad or maybe use webrtc vad?

I guess experimentation is needed to figure out the best strategy / approach to real time considering the 30 sec at once issue.

Controls how often we run the inference. By default, we run it every 3 seconds.

Seems the results become worse when we keep the context, so by default this is not enabled

ggerganov · 2022-10-10T19:09:00Z

Some improvement on the real-time transcription:

rt_esl_csgo_2.mp4

trholding · 2022-10-10T19:29:08Z

I'll check this out and give you feedback here tomorrow. Awesome work! Brilliant.

moebiussurfing · 2022-10-11T01:48:56Z

hello @ggerganov , thanks for sharing!
Offline main mode tested here on Windows worked fine.

~~Any small tip to include the SDL to make work the real time app?~~

trholding · 2022-10-11T05:36:27Z

On resource constrained machines it doesn't seem to be better. The previous version worked for transcribing, this one is chocking the cpu with no or only intermittent output. Same kill issue persists - I think its because processes are spawned and makes the system laggy.

@ggerganov I also caught a Floating point exception (core dumped) playing around with options : -t 2 --step 5000 --length 5000

./stream -m ./models/ggml-tiny.en.bin -t 2 --step 5000 --length 5000
audio_sdl_init: found 2 capture devices:
audio_sdl_init:    - Capture device #0: 'Built-in Audio'
audio_sdl_init:    - Capture device #1: 'Built-in Audio Analog Stereo'
audio_sdl_init: attempt to open default capture device ...
audio_sdl_init: obtained spec for input device (SDL Id = 2):
audio_sdl_init:     - sample rate:       16000
audio_sdl_init:     - format:            33056 (required: 33056)
audio_sdl_init:     - channels:          1 (required: 1)
audio_sdl_init:     - samples per frame: 1024
whisper_model_load: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 244.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  84.99 MB
whisper_model_load: memory size =    11.41 MB 
whisper_model_load: model size  =    73.54 MB

main: processing 80000 samples (step = 5.0 sec / len = 5.0 sec), 2 threads, lang = en, task = transcribe, timestamps = 0 ...

Floating point exception (core dumped)

trholding · 2022-10-11T05:37:41Z

But I think this not working on resource constrained devices should not be a blocker for you. If it works for everyone else, please feel free to close.

tazz4843 · 2022-10-11T05:52:33Z

I think that floating point exception might be related to #39 as well, which was running on a 4 core AMD64 Linux server, not too resource constrained.

ggerganov · 2022-10-11T17:20:20Z

The stream example should be updated to detect if it is able to process the incoming audio stream in real-time and provide some warning or error if it is not the case. Otherwise, it will behave in undefined way.

pachacamac · 2022-10-15T13:22:24Z

Also mentioning this here since it would be a super cool feature to have: Any way to register a callback or call a script once user speech is completed and silence/non-speak is detected? Been trying to hack on the CPP code but my CPP skills are rusty :(

ggerganov · 2022-10-29T17:03:33Z

@pachacamac Will think about adding this option. Silence/non-speak detection is not trivial in general, but maybe some simple thresholding approach that works in quiet environment should not be too difficult to implement.

RyanSelesnik · 2022-10-29T19:58:28Z

Hi, I am trying to run real-time transcription on the Raspberry Pi 4B, with a ReSpeaker Mic array. Is there any way to specify the audio input device when running ./stream?

alexose · 2022-10-29T20:50:07Z

Hi, I am trying to run real-time transcription on the Raspberry Pi 4B, with a ReSpeaker Mic array. Is there any way to specify the audio input device when running ./stream?

I was able to specify the default input device through /etc/asound.conf:

pcm.!default {
  type asym
   playback.pcm {
     type plug
     slave.pcm "hw:0,0"
   }
   capture.pcm {
     type plug
     slave.pcm "hw:1,0"
   }
}

Curious if you have any luck getting real-time transcription to work on a Pi 4. Mine seems to run just a little too slow to give useful results, even with the tiny.en model.

andres-ramirez-duque · 2022-11-03T12:16:51Z

Hi @alexose and @RyanSelesnik:

Have you had any success using Respeaker 4 Mic Array (UAC1.0) to run the stream script on Raspberry?
My system conf is:

Raspberry Pi 4 Model B Rev 1.4
Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-1073-raspi aarch64)

ubuntu@ubuntu:~/usrlib/whisper.cpp$ ./stream -m ./models/ggml-tiny.en.bin -t 8 --step 500 --length 5000 -c 0
audio_sdl_init: found 1 capture devices:
audio_sdl_init: - Capture device #0: 'ReSpeaker 4 Mic Array (UAC1.0), USB Audio'
audio_sdl_init: attempt to open capture device 0 : 'ReSpeaker 4 Mic Array (UAC1.0), USB Audio' ...
audio_sdl_init: obtained spec for input device (SDL Id = 2):
audio_sdl_init: - sample rate: 16000
audio_sdl_init: - format: 33056 (required: 33056)
audio_sdl_init: - channels: 1 (required: 1)
audio_sdl_init: - samples per frame: 1024
whisper_model_load: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem_required = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 73.58 MB
whisper_model_load: memory size = 11.41 MB
whisper_model_load: model size = 73.54 MB

main: processing 8000 samples (step = 0.5 sec / len = 5.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 0 ...
main: n_new_line = 9

[BLANK_AUDIO]

main: WARNING: cannot process audio fast enough, dropping audio ...

but using whisper on prerecorded audio with the same Respeaker devices the whisper worked well:

sudo arecord -f S16_LE -d 10 -r 16000 --device="hw:1,0" /tmp/test-mic.wav
./main -m models/ggml-tiny.en.bin -f /tmp/test-mic.wav

Any suggestions to test ./stream properly
Cheers
AR

alexose · 2022-11-03T16:11:40Z

@andres-ramirez-duque I haven't had any luck in getting the streaming functionality to run fast enough. aarch64 with a Pi 4B 2GB. I've tried compiling with various flags (-Ofast) and trying various step length, thread count, etc.

I'm not good enough with C++ to know where to start optimizing, but I suspect the comment in PR #23 sheds some light on the issue:

On Arm platforms without __ARM_FEATURE_FP16_VECTOR_ARITHMETIC we convert to 32-bit floats. There might be a more efficient way, but this is good for now.

As well as the notes on optimization from @trholding and @ggerganov above.

ggerganov · 2022-11-03T17:44:41Z

So the rule-of-thumb for using the stream example is to first run the bench tool using the model that you want to try. For example:

$ make bench
$ ./bench models/ggml-tiny.en

whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | 

whisper_print_timings:     load time =   103.94 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =   174.70 ms / 43.67 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =   278.77 ms

Note down the encode time. In this case, it is 174 ms.
Your --step parameter for the stream tool should be at least x2 the encode time. So in this case: --step 350.

If the step is smaller, then it is very likely that the processing will be slower compared to the audio capture and it won't work.

Overall, streaming on Reaspberries is a long shot. Maybe when they start supporting the FP16 arithmetic (i.e. ARMv8.2 instruction set) it could make sense.

alexose · 2022-11-03T19:05:27Z

Yes, makes sense.

For those of us trying to make this work on a cheap single-board computer, we'll probably want to use something like a Banana Pi BPI M5 (which is form-factor compatible with the Pi 4 but ships with a Cortex A55).

userrand · 2023-08-26T20:10:10Z

Dear all expert here,

I a, able to run my raspberry pi 4 whisper.cpp run the live stream using following code:

git clone https://github.com/ggerganov/whisper.cpp ./models/download-ggml-model.sh tiny.en make -j stream && ./stream -m models/ggml-tiny.en.bin --step 4000 --length 8000 -c 0 -t 4 -ac 512

The question is that possible I can convert those word in python text format?

Appreciate someone can help!

you can maybe use the f option (if necessary see
./stream --help) to output the transcription onto a file.
From there you can use a script to do whatever you want with the output from the file.

If you want to do this continuously with the stream, you could maybe consider using a while loop that applies some function to the file and that checks for updates on the file. I'm not an expert but chatgpt told me that on linux inotify can check for updates on a file.

FelixBrakel · 2023-09-05T11:07:51Z

I've managed to compile and run this with the OpenVINO backend. Only required adding 1 line to initialize the OpenVINO encoder and some additional flags during compilation to build both OpenVINO and stream.

quick and dirty setup:

Add the following line to stream.cpp after initializing the whisper context:

    whisper_ctx_init_openvino_encoder(ctx, nullptr, "GPU", nullptr);

Then make sure you follow all the steps to install OpenVINO, then to build:

mkdir build
cd build
cmake -DWHISPER_OPENVINO=1 -DWHISPER_BUILD_EXAMPLES=1 -DWHISPER_SDL2=1 ..

the -ac flag no longer works with this method but I am now able to transcribe my meetings with the small.en model running on a laptop with just intel integrated graphics.

aehlke · 2023-09-15T23:38:38Z

I found this Swift implementation of streaming: https://github.com/leetcode-mafia/cheetah/blob/b7e301c0ae16df5c597b564b2126e10e532871b2/LibWhisper/stream.cpp with a swift file inside a swit project. It's CC0 licensed.

I couldn't tell if it's uses the right config to benefit from the latest Metal/OpenML performance-oriented config

dhwkdjwndjwjjn · 2023-09-19T05:02:02Z

usage: ./stream [options]

options:
-h, --help [default] show this help message and exit
-t N, --threads N [4 ] number of threads to use during computation
--step N [3000 ] audio step size in milliseconds
--length N [10000 ] audio length in milliseconds
--keep N [200 ] audio to keep from previous step in ms
-c ID, --capture ID [-1 ] capture device ID
-mt N, --max-tokens N [32 ] maximum number of tokens per audio chunk
-ac N, --audio-ctx N [0 ] audio context size (0 - all)
-vth N, --vad-thold N [0.60 ] voice activity detection threshold
-fth N, --freq-thold N [100.00 ] high-pass frequency cutoff
-su, --speed-up [false ] speed up audio by x2 (reduced accuracy)
-tr, --translate [false ] translate from source language to english
-nf, --no-fallback [false ] do not use temperature fallback while decoding
-ps, --print-special [false ] print special tokens
-kc, --keep-context [false ] keep context between audio chunks
-l LANG, --language LANG [en ] spoken language
-m FNAME, --model FNAME [models/ggml-base.en.bin] model path
-f FNAME, --file FNAME [ ] text output file name
-tdrz, --tinydiarize [false ] enable tinydiarize (requires a tdrz model)

Hi, I am sometime confused on what each argument do.
Can someone who understand explain what those mean:
-kc, --keep-context [false ] keep context between audio chunks
--keep N [200 ] audio to keep from previous step in ms
-ac N, --audio-ctx N [0 ] audio context size (0 - all)

Or is there a link or file I can read myself to understand those parameters better?
Thank you so much!

…ov#10)

andupotorac · 2023-11-02T20:14:32Z

Thanks for the quick fix. I have some suggestions/ideas, for faster voice transcription. Give me half an hour to one hour, I'll update here with new content.

Edit / Updated:

Here are some ideas to speed up offline non real time transcription:

Removing silence helps a lot in reducing total time of audio (not yet tried but obvious):

http://andrewslotnick.com/posts/speeding-up-a-speech.html#Remove-Silence

Things that I tried with good results:

First I ran an half an hour audio file through https://github.com/xiph/rnnoise code. Then I increased the tempo to 1.5 with sox (tempo preserves pitch). After that I got good results with tiny.en but base.en seemed to be less accurate. Overall process is much faster - real fast transcription except for the initial delay.
cd /tmp
./rnnoise_demo elon16.wav elon16.raw
sox -c 1 -r 16000 -b 16 --encoding signed-integer elon16.raw elon16_denoised.wav
sox elon16_denoised.wav elonT3.wav tempo 1.5
./main -m models/ggml-tiny.en.bin -f /tmp/elonT3.wav
Here are some ideas for faster real time transcription:

I noticed that when I ran this on a 5 sec clip, I got this result:
./main -m models/ggml-tiny.en.bin -f /tmp/rec.wav 
log_mel_spectrogram: recording length: 5.015500 s
...
main: processing 80248 samples (5.0 sec), 2 threads, lang = english, task = transcribe, timestamps = 1 ...

[00:00.000 --> 00:05.000]   Okay, this is a test. I think this will work out nicely.
[00:05.000 --> 00:10.000]   [no audio]
...
main:    total time = 18525.62 ms
Now if we could apply this:

1 VAD / Silence detection(like you mentioned) split into chunks. The result is variable length audio chunks in memory or temp files 2 Remove noise with rrnoise on chunks 3 Speed up chunck by 1.5x preserving pitch (the speed up should just be an option. I learned that anything above 1.5x results are bad except if voice is loud clear and slow to start with, 1.5x is safe. Ideal is 1.1-1.5x max 2x) 4 Since we know exactly how long the sped up chunk is, we won't need to wait for transcription to finish...

Example: [00:00.000 --> 00:05.000] Okay, this is a test. I think this will work out nicely. <--- We could kill it right here (cos this is the total length of that file / chunck I had as example) [00:05.000 --> 00:10.000] [no audio] <-- This is processing on empty buffer, when killed would not waste processing

VAD: https://github.com/cirosilvano/easyvad or maybe use webrtc vad?

I guess experimentation is needed to figure out the best strategy / approach to real time considering the 30 sec at once issue.

Were you able to implement any of these ideas? Are there significant performance improvements?

bestofman · 2023-11-15T20:09:17Z

Hi,
I just tried the following command:

make stream
./stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000

And it works, but the result I get is far behind the demo video. It just gets stuck with the first sentence and tries to update it instead of adding new sentences.

aehlke · 2023-11-15T20:38:08Z

silero-vad looks best for VAD but I don't know how to port this to Swift yet - onnx and python notebook

edit: found a swift port, https://github.com/tangfuhao/Silero-VAD-for-iOS

ArmanJR · 2023-12-03T02:51:45Z

After making my model using Core ML, I got this error while trying to build stream:

$ make stream
I whisper.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC:       Apple clang version 15.0.0 (clang-1500.0.40.1)
I CXX:      Apple clang version 15.0.0 (clang-1500.0.40.1)

c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_METAL examples/stream/stream.cpp examples/common.cpp examples/common-ggml.cpp examples/common-sdl.cpp ggml.o ggml-alloc.o ggml-backend.o ggml-quants.o whisper.o ggml-metal.o -o stream `sdl2-config --cflags --libs`  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
ld: Undefined symbols:
  _whisper_coreml_encode, referenced from:
      whisper_build_graph_conv(whisper_context&, whisper_state&, int) in whisper.o
  _whisper_coreml_free, referenced from:
      _whisper_free_state in whisper.o
  _whisper_coreml_init, referenced from:
      _whisper_init_state in whisper.o
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [stream] Error 1

jpiabrantes · 2023-12-13T08:22:16Z

Some improvement on the real-time transcription:

rt_esl_csgo_2.mp4

This is awesome! Would love to have a high-level description of the optimisations that were made.

artshcherbina · 2023-12-16T16:03:37Z

Thank you for you great work!

I've added some simple logic to detect silence, and process only real voice input: #1649.

thiemom · 2023-12-20T09:28:55Z

After making my model using Core ML, I got this error while trying to build stream:

$ make stream
I whisper.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC:       Apple clang version 15.0.0 (clang-1500.0.40.1)
I CXX:      Apple clang version 15.0.0 (clang-1500.0.40.1)

c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_METAL examples/stream/stream.cpp examples/common.cpp examples/common-ggml.cpp examples/common-sdl.cpp ggml.o ggml-alloc.o ggml-backend.o ggml-quants.o whisper.o ggml-metal.o -o stream `sdl2-config --cflags --libs`  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
ld: Undefined symbols:
  _whisper_coreml_encode, referenced from:
      whisper_build_graph_conv(whisper_context&, whisper_state&, int) in whisper.o
  _whisper_coreml_free, referenced from:
      _whisper_free_state in whisper.o
  _whisper_coreml_init, referenced from:
      _whisper_init_state in whisper.o
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [stream] Error 1

Try compiling stream with WHISPER_COREML=1

AimoneAndex · 2023-12-27T12:08:51Z

Why does the addition of the - l zh parameter only display Traditional Chinese, and the addition of the -- prompt parameter display invalid parameters? How to display Simplified Chinese? And the accuracy of Chinese recognition is not very high, it needs to be solved. Thank you! Thank you very much!

zixiai · 2024-01-18T08:29:54Z

为什么添加-l zh参数只显示繁体中文，添加--prompt参数显示无效参数？如何显示简体中文？而且中文识别的准确率不是很高，需要解决。谢谢你！非常感谢！

add --prompt "简体输出"

AimoneAndex · 2024-01-26T12:38:26Z

使用stream（实时模式）脚本时候加--prompt显示该参数无效＞︿＜需要的话有空把原来报错发出来你看下(●'◡'●)

Link against C++ standard library and macOS Accelerate framework

Removes things that were added in ggerganov#10 and adds note on Linux builds

helins · 2024-02-15T13:29:53Z

stream is really awesome, eventhough it is labeled as merely "naive". Not ready to be shared yet but I have written a Neovim plugin that writes directly to the buffer, updating the line in real time. It feels nothing short of magical.

I was wondering if there was a way to provide some text context to improve inference even further? For instance, the model sometimes struggles to understand some technical terms that I am saying when writing to a buffer. But if I had a way of providing that buffer as context, which might already contain those words or concepts relating to them, it might greatly improve that kind of situations.

aehlke · 2024-02-15T22:16:13Z

you can prepend context in whisper, not sure about this stream implementation's options though.

helins · 2024-02-17T05:11:06Z

I am a Whisper noob and furthermore not quite proficient at C++ but any pointers (pun intented) would be greatly appreciated

Utopiah · 2024-03-10T15:55:50Z

@pachacamac wrote (a while ago)

Any way to register a callback or call a script once user speech is completed and silence/non-speak is detected? Been trying to hack on the CPP code but my CPP skills are rusty

and AFAICT this isn't the case yet so my basic advice is ./stream -f liveoutput getting the result in a text file then watch cat liveoutput to periodically (here every 2s) show the result.

A cleaner way to do this could be to rely on inotify to check if the file has been modified then act on that. Overall a bit of logic has to be added on top, e.g is the sentence detected new or filtering out things like [typing sounds] or (keyboard clicking), but I imagine it's enough to start without having to touch any C++ code.

To get the very last line that might match content cat liveoutput | grep -v '(' | grep -v ']' | tail -1

troed · 2024-03-13T08:41:46Z

Is the -vth parameter working as intended? Just compiling the example and testing with large-v3 model (GPU, step set to 700 works fine) I get "hallucinations" when not talking. With English language there's a constant "thank you" output and with Swedish I get absolutely hilarious and rather long sentences from the training data.

Tried vth from 0.1 to 48000 without seeing any difference.

Toddinuk · 2024-03-14T13:49:57Z

When I run the command "./stream -m ./models/ggml-large-v3.bin -t 8 --step 500 --length 5000 -l zh", the parameter "-l zh" does not seem to function properly. If I speak English, it transcribes correctly, but when I speak Chinese, there is no response. Additionally, there are some unknown Chinese characters displayed.

zaccheus · 2024-03-22T12:26:38Z

When I run the command "./stream -m ./models/ggml-large-v3.bin -t 8 --step 500 --length 5000 -l zh", the parameter "-l zh" does not seem to function properly. If I speak English, it transcribes correctly, but when I speak Chinese, there is no response. Additionally, there are some unknown Chinese characters displayed.

I also encountered this problem

pprobst · 2024-04-08T14:19:33Z

Is the -vth parameter working as intended? Just compiling the example and testing with large-v3 model (GPU, step set to 700 works fine) I get "hallucinations" when not talking. With English language there's a constant "thank you" output and with Swedish I get absolutely hilarious and rather long sentences from the training data.

Tried vth from 0.1 to 48000 without seeing any difference.

You need to set --step 0. See: https://github.com/ggerganov/whisper.cpp/tree/master/examples/stream#sliding-window-mode-with-vad

pprobst · 2024-04-09T16:19:42Z

I recently came upon this project from watching this video. It looks very good to me, perhaps the best Whisper "streaming" example I've seen.

It seems to be using whisper_streaming under the hood. I wonder how hard would it be to implement their algorithm on whisper.cpp?

ggerganov added the enhancement New feature or request label Oct 1, 2022

ggerganov added a commit that referenced this issue Oct 2, 2022

ref #10 : quick-and-dirty attempt for real-time audio transciption

b6bf906

- Processes input in chunks of 3 seconds. - Padding audio with silence - Uses 1 second audio from previous pass - No text context

ggerganov added a commit that referenced this issue Oct 2, 2022

ref #10 : handle Ctrl+C in "stream" app

be8ba03

ggerganov added a commit that referenced this issue Oct 7, 2022

ref #10 : add "step" argument for "stream" example

3f15bb8

Controls how often we run the inference. By default, we run it every 3 seconds.

ggerganov added a commit that referenced this issue Oct 7, 2022

ref #10 : option to keep context in "stream" example

481cd68

Seems the results become worse when we keep the context, so by default this is not enabled

pachacamac mentioned this issue Oct 15, 2022

Python bindings (C-style API) #9

Open

ggerganov closed this as completed Oct 29, 2022

ggerganov added a commit that referenced this issue Oct 30, 2022

stream : add "--capture" option to select capture device (ref #10)

5a9e426

fgn mentioned this issue Oct 13, 2023

Streaming with whisperx m-bain/whisperX#476

Open

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023

stream : add "--capture" option to select capture device (ref ggergan…

36fb7cf

…ov#10)

xenova mentioned this issue Nov 23, 2023

[Feature request] Real time whisper transcription xenova/transformers.js#405

Open

kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024

Merge pull request ggerganov#10 from alugha/missing-symbols

0f52002

Link against C++ standard library and macOS Accelerate framework

kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024

Update README.md

330024a

Removes things that were added in ggerganov#10 and adds note on Linux builds

zhouwg mentioned this issue Mar 4, 2024

PoC:clean-room implementation of real-time AI subtitle for English online-TV(OTT TV) zhouwg/kantv#64

Closed

bradmit mentioned this issue May 23, 2024

Crash with multiple whisper states running at the same time CUDA #2177

Closed

Support for realtime audio input #10

Support for realtime audio input #10

Comments

biemster commented Oct 1, 2022

ggerganov commented Oct 1, 2022

biemster commented Oct 1, 2022

ggerganov commented Oct 2, 2022

ggerganov commented Oct 2, 2022

trholding commented Oct 2, 2022

ggerganov commented Oct 2, 2022

trholding commented Oct 2, 2022 • edited

ggerganov commented Oct 10, 2022

trholding commented Oct 10, 2022

moebiussurfing commented Oct 11, 2022 • edited

trholding commented Oct 11, 2022

trholding commented Oct 11, 2022

tazz4843 commented Oct 11, 2022 • edited

ggerganov commented Oct 11, 2022

pachacamac commented Oct 15, 2022 • edited

ggerganov commented Oct 29, 2022

RyanSelesnik commented Oct 29, 2022

alexose commented Oct 29, 2022 • edited

andres-ramirez-duque commented Nov 3, 2022

alexose commented Nov 3, 2022 • edited

ggerganov commented Nov 3, 2022

alexose commented Nov 3, 2022 • edited

userrand commented Aug 26, 2023 • edited

FelixBrakel commented Sep 5, 2023

aehlke commented Sep 15, 2023

dhwkdjwndjwjjn commented Sep 19, 2023

andupotorac commented Nov 2, 2023

bestofman commented Nov 15, 2023

aehlke commented Nov 15, 2023 • edited

ArmanJR commented Dec 3, 2023

jpiabrantes commented Dec 13, 2023

artshcherbina commented Dec 16, 2023 • edited

thiemom commented Dec 20, 2023

AimoneAndex commented Dec 27, 2023

zixiai commented Jan 18, 2024

AimoneAndex commented Jan 26, 2024

helins commented Feb 15, 2024

aehlke commented Feb 15, 2024

helins commented Feb 17, 2024

Utopiah commented Mar 10, 2024 • edited

troed commented Mar 13, 2024

Toddinuk commented Mar 14, 2024

zaccheus commented Mar 22, 2024

pprobst commented Apr 8, 2024 • edited

pprobst commented Apr 9, 2024 • edited

trholding commented Oct 2, 2022 •

edited

moebiussurfing commented Oct 11, 2022 •

edited

tazz4843 commented Oct 11, 2022 •

edited

pachacamac commented Oct 15, 2022 •

edited

alexose commented Oct 29, 2022 •

edited

alexose commented Nov 3, 2022 •

edited

alexose commented Nov 3, 2022 •

edited

userrand commented Aug 26, 2023 •

edited

aehlke commented Nov 15, 2023 •

edited

artshcherbina commented Dec 16, 2023 •

edited

Utopiah commented Mar 10, 2024 •

edited

pprobst commented Apr 8, 2024 •

edited

pprobst commented Apr 9, 2024 •

edited