Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bidirectional gRPC streaming (async) transcription #1833

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

shanelenagh
Copy link

@shanelenagh shanelenagh commented Feb 4, 2024

In playing with a potential use case of whisper.cpp in a Kubernetes container, I found it expedient to implement a gRPC interface to whisper.cpp, particularly the streaming example of it. This version uses gRPC bidirectional streams based on a simple protobuf that I came up with. This allows whisper.cpp to be called as a realtime service from a variety of client environments using the fairly efficient and flexible gRPC network protocol. I didn't include it in this PR, but I have done two different Python clients to demonstrate the "client" side of this gRPC interaction (this stream.grpc code being the "server"), both as a CLI client and a minimal web client (using websockets)--if there is interest in having this included in the main PR here I could add them as a subdirectory as well:

https://github.com/shanelenagh/whisper.cpp/tree/stream.grpc/examples/stream.grpc/python-test-clients

I don't claim to be a C++ expert at all; started my studies in programming with C in 1995, but over my career as a developer I have spent 90% of it in Java--so there are no doubt ways the code can be improved. But given this is just an "example," and I have indeed proven that it appears to work in my use case (sure there is some bug or two lurking) I thought it safe to submit the PR in the current state. Let me know if you think the broader community might find this useful, or if you need additional information (screencasts of usage, etc.). You do need the C++ version of gRPC installed to do this build.

@shanelenagh
Copy link
Author

README.md added

@shanelenagh shanelenagh changed the title New example: Bidirectional gRPC streaming (async) transcription Bidirectional gRPC streaming (async) transcription Feb 4, 2024
shanelenagh and others added 5 commits February 5, 2024 10:16
…size decrement from head of audio sample timestamp, and decreased diffs from existing stream.cpp
…is not the mic/SDL version (obviously much of this code was copied from that, intentionally--perhaps someday these can all have a common stream.cpp class and swap 'implementations' of the asynch_audio classes that implement a common interface)
…dio once 'resume' is called); 2ms is consistent with write wait time as well
@ggerganov
Copy link
Owner

Thank you for this example - I'm not familiar with gRPC, but it does seem useful!

I'm not sure if we can merge this however - it brings a bit of 3rd parties (protobuf, gRPC). I try to keep the dependencies in the project to a minimum, otherwise it becomes difficult to maintain and it also means less portability.

There is a pending PR #1823 with stdin streaming example which I am more interested in merging. Similar to this PR, it also builds on top of the existing stream example. Want to see if it will find a way to reuse the stream code instead of copy-pasting it and if it does, maybe the same strategy can be applied here as well

But still, the dependencies might remain a problem. I guess this example could live in it's own repository. If you decide to do it like that, I will happily reference it from here

@shanelenagh
Copy link
Author

shanelenagh commented Feb 6, 2024

Thank you for this example - I'm not familiar with gRPC, but it does seem useful!

I'm not sure if we can merge this however - it brings a bit of 3rd parties (protobuf, gRPC). I try to keep the dependencies in the project to a minimum, otherwise it becomes difficult to maintain and it also means less portability.

There is a pending PR #1823 with stdin streaming example which I am more interested in merging. Similar to this PR, it also builds on top of the existing stream example. Want to see if it will find a way to reuse the stream code instead of copy-pasting it and if it does, maybe the same strategy can be applied here as well

But still, the dependencies might remain a problem. I guess this example could live in it's own repository. If you decide to do it like that, I will happily reference it from here

Fair enough. The zero-dependency aspect of whisper.cpp is one of the things I love most about it (the speed and accuracy being awesome as well, of course). I structured this change the same way the original stream example was setup, whereas if you don't set the variable it will not build it and thus not require that dependency, and yes, it similarly requires a separate library (e.g., SDL in the case of the original, and gRPC in this one--and to be clear protobuf is part of gRPC, so that is a single dependency there). I guess OpenVINO and CUBLAS are somewhat similar as well (as an aside, I am really glad OpenVINO was added as an option, as I see tremendous improvements on the hardware we are using), but those are more core to what the library does, I suppose. I agree that it would be great to have a common "stream.cpp" and just be able to swap out "audio_async" implementations, implementing a common "interface" (I know C++ has something similar to the Java interfaces there, but I am not that familiar with it)--that would be a dream. I could potentially work on that.

I have my original work in a fork of this, and I will likely just keep it there, with this as upstream, unless someone else sees some utility in a "separate" project and convinces me that is worth doing. I want to be able to rebase from whisper.cpp frequently to get the improvements being made, so that seems to be the easiest way. And I want it to be known that I am definitely on the shoulders of giants like you and the other contributors for the core value of the tool. :-) Thank you so much for the wonderful work you and the team have done on this.

@shanelenagh
Copy link
Author

I just took a look at the stdin PR, and yeah, we both seem to be wrestling with the duplication issue. I wonder if an "abstract" (had to look up the CPP parlance) class for audio_async could be defined for any one of us wanting to supply an "audio provider" for the stream service. Even the concrete implementation of most of the audio_async methods of our versions is 90% the same as the original SDL version--those could be implemented once and then have a pure virtual "get_bytes" method that is implemented by the concrete provider? Then the stream.cpp code could be the same as well (say calling a factory method that passes the params struct, with a new param specifying the "audio input interface" for SDL, stdin, etc.). I would love to take a stab at refactoring that out, if I can get a bit more time.

@shanelenagh
Copy link
Author

shanelenagh commented Feb 7, 2024

I just took a look at the stdin PR, and yeah, we both seem to be wrestling with the duplication issue. I wonder if an "abstract" (had to look up the CPP parlance) class for audio_async could be defined for any one of us wanting to supply an "audio provider" for the stream service. Even the concrete implementation of most of the audio_async methods of our versions is 90% the same as the original SDL version--those could be implemented once and then have a pure virtual "get_bytes" method that is implemented by the concrete provider? Then the stream.cpp code could be the same as well (say calling a factory method that passes the params struct, with a new param specifying the "audio input interface" for SDL, stdin, etc.). I would love to take a stab at refactoring that out, if I can get a bit more time.

I took the stdin version (copied @regularfry's branch to my fork), and I am starting to refactor the existing stream program to use the following abstract class I created in a new "common-asyncaudio.h"--compiles and runs, but that's as far as I have gotten so far (next step is to put the common code in that parent class, move the stdin to use that abstract(is) class, add the param for the input source, add the factory method for dynamically returning the concrete instance that is specified in the param, etc.): shanelenagh@9c1a270

#pragma once


#include <atomic>
#include <cstdint>
#include <vector>
#include <mutex>
#include <thread>

// command-line parameters
struct whisper_params {
    int32_t n_threads  = std::min(4, (int32_t) std::thread::hardware_concurrency());
    int32_t step_ms    = 3000;
    int32_t length_ms  = 10000;
    int32_t keep_ms    = 200;
    int32_t capture_id = -1;
    int32_t max_tokens = 32;
    int32_t audio_ctx  = 0;

    float vad_thold    = 0.6f;
    float freq_thold   = 100.0f;

    bool speed_up      = false;
    bool translate     = false;
    bool no_fallback   = false;
    bool print_special = false;
    bool no_context    = true;
    bool no_timestamps = false;
    bool tinydiarize   = false;
    bool save_audio    = false; // save audio to wav file
    bool use_gpu       = true;

    std::string language  = "en";
    std::string model     = "models/ggml-base.en.bin";
    std::string fname_out;
};

//
// Abstract interface for audio capture
//
class audio_async {
public:
    audio_async(int len_ms) { };
    ~audio_async() { };

    virtual bool init(whisper_params params, int sample_rate) = 0;

    virtual bool resume() = 0;
    virtual bool pause() = 0;
    virtual bool clear() = 0;

    // get audio data from the circular buffer
    virtual void get(int ms, std::vector<float> & audio) = 0;

    // callback to be called by audio source
    virtual void callback(uint8_t * stream, int len) = 0;
};

and then the SDL one implements that interface (snippet of modified common-sdl.h):

//
// SDL Audio capture
//
class audio_async_sdl : public audio_async {
public:
    audio_async_sdl(int len_ms);
    ~audio_async_sdl();

    bool init(whisper_params params, int sample_rate) override;

    // start capturing audio via the provided SDL callback
    // keep last len_ms seconds of audio in a circular buffer
    bool resume() override;
    bool pause() override;
    bool clear() override;

    // callback to be called by SDL
    void callback(uint8_t * stream, int len) override;

    // get audio data from the circular buffer
    void get(int ms, std::vector<float> & audio) override;

private:
    SDL_AudioDeviceID m_dev_id_in = 0;
    // etc...
};

@shanelenagh
Copy link
Author

Been taken away from the project for a few days by work and home life, but I just factored out the common code into the abstract class--now just the SDL code remains in the subclass (next step is to make the stdin version do the same, and create the factory method that creates the right version based on a new CLI param): shanelenagh@2a2307b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants