Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows) eg Cubeb instead of SDL2? #2062

dgm3333 · 2024-04-16T05:46:31Z

Since Whisper is already so awesome I now want to push it a bit harder and simultaneously record input and output of a headset to different audio channels to generate a diarised transcription.
I'm only after a Windows solution, and a pure Linux or Mac solution would not solve the problem
I'm happy if the input (to headset speakers) is compressed to mono so there would only be two recording channels required.

Is there anyone else who has done/is doing this?

As far as I can tell SDL2 does not offer this capability.

Cubeb (Mozilla cross-platform audio library) does appear to be able to do this out of the box (see untested chatGPT code snippet below).
https://github.com/mozilla/cubeb

I'm not sure why Whisper uses SDL2 - whether there may be reasons that shifting to Cubeb would be a poor option?
The cubeb dll is only 133kb whereas SDL2 is 1620 - so it would also seem more lightweight option whereas SDL seems to be bulked up with code for generating a standalone application.

There are a number of suggestions in other posts, but none of them are optimal:-
#377 - Run two transcriptions and merge the text
#961 - This uses ffmpeg to reroute output to input - but isn't merging the streams
#1468 - Install an external loopback

I asked ChatGPT and it suggested the following as a base for a Cubeb option.

To accomplish audio recording with Cubeb, including both microphone input and system audio output (e.g., speakers) with each going to separate channels of a stereo recording, you'll need to use Cubeb's advanced audio stream handling capabilities. This task involves opening two separate streams in Cubeb: one for capturing microphone audio (input stream) and another for capturing system audio output via loopback (output stream). You then mix these two mono streams into a single stereo stream in your application's logic.

First, ensure you have the Cubeb library set up in your project (eg with vcpkg)
vcpkg install cubeb

/*
* AudioData Structure: Manages microphone and speaker buffers, mutex for synchronization, and file output.
* Callbacks: Append incoming audio data to respective buffers.
* Mix and Output: Mixes the microphone and speaker buffers into a stereo buffer and writes it to a file. This function should ideally be called periodically, perhaps from a separate thread or in response to a timer or sufficient buffer accumulation.
* Main Function: Initializes Cubeb, sets up the streams, starts them, and handles cleanup.
*/


#include <cubeb/cubeb.h>
#include <iostream>
#include <vector>
#include <mutex>
#include <fstream>
#include <thread>
#include <atomic>
#include <cstring>

struct AudioData {
    std::vector<float> micBuffer;
    std::vector<float> speakerBuffer;
    std::mutex bufferMutex;
    std::string outputFileName;
    std::ofstream outputFile;
    std::atomic<bool> keepRunning{ true };

    AudioData() {
        outputFile.open(outputFileName, std::ios::binary);  // Open binary file for raw audio data
    }

    ~AudioData() {
        if (outputFile.is_open())
            outputFile.close();
    }

    void writeToFile(const std::vector<float>& stereoBuffer) {
        if (!outputFile.is_open()) return;
        outputFile.write(reinterpret_cast<const char*>(stereoBuffer.data()), stereoBuffer.size() * sizeof(float));
    }
};

long microphone_data_callback(cubeb_stream* stream, void* user_data, const void* input_buffer, void* /* output_buffer */, long nframes) {
    AudioData* data = static_cast<AudioData*>(user_data);
    std::lock_guard<std::mutex> lock(data->bufferMutex);
    const float* input = static_cast<const float*>(input_buffer);
    data->micBuffer.insert(data->micBuffer.end(), input, input + nframes);
    return nframes;
}

long speaker_data_callback(cubeb_stream* stream, void* user_data, const void* /* input_buffer */, void* output_buffer, long nframes) {
    AudioData* data = static_cast<AudioData*>(user_data);
    std::lock_guard<std::mutex> lock(data->bufferMutex);
    const float* output = static_cast<const float*>(output_buffer);
    data->speakerBuffer.insert(data->speakerBuffer.end(), output, output + nframes);
    return nframes;
}

void state_callback(cubeb_stream* stream, void* user_data, cubeb_state state) {
    std::cout << "Stream state changed to " << state << std::endl;
}

void mix_and_output(AudioData& audioData) {
    while (audioData.keepRunning.load()) {
        std::vector<float> mixedBuffer;
        {
            std::lock_guard<std::mutex> lock(audioData.bufferMutex);
            auto minSize = std::min(audioData.micBuffer.size(), audioData.speakerBuffer.size());
            mixedBuffer.reserve(minSize * 2);

            for (size_t i = 0; i < minSize; i++) {
                mixedBuffer.push_back(audioData.micBuffer[i]);  // Left channel from microphone
                mixedBuffer.push_back(audioData.speakerBuffer[i]);  // Right channel from speakers
            }

            audioData.micBuffer.erase(audioData.micBuffer.begin(), audioData.micBuffer.begin() + minSize);
            audioData.speakerBuffer.erase(audioData.speakerBuffer.begin(), audioData.speakerBuffer.begin() + minSize);
        }

        audioData.writeToFile(mixedBuffer);
        std::this_thread::sleep_for(std::chrono::milliseconds(100));  // Reduce CPU usage
    }
}

void record_audio_with_cubeb(AudioData& audioData) {
    cubeb* ctx = nullptr;
    cubeb_stream* microphone_stream = nullptr;
    cubeb_stream* speaker_stream = nullptr;

    if (cubeb_init(&ctx, "Cubeb audio capture example", NULL) != CUBEB_OK) {
        std::cerr << "Failed to initialize Cubeb" << std::endl;
        return;
    }

    cubeb_stream_params microphone_params;
    microphone_params.format = CUBEB_SAMPLE_FLOAT32NE;
    microphone_params.rate = 48000;
    microphone_params.channels = 1;
    microphone_params.layout = CUBEB_LAYOUT_MONO;

    cubeb_stream_params speaker_params;
    speaker_params.format = CUBEB_SAMPLE_FLOAT32NE;
    speaker_params.rate = 48000;
    speaker_params.channels = 1;
    speaker_params.layout = CUBEB_LAYOUT_MONO;

    uint32_t latency_frames = 256;  // Set latency frames to a reasonable default

    if (cubeb_stream_init(ctx, &microphone_stream, "Microphone stream", NULL, &microphone_params, NULL,
        NULL, latency_frames, microphone_data_callback, state_callback, &audioData) != CUBEB_OK) {
        std::cerr << "Failed to open microphone stream" << std::endl;
        cubeb_destroy(ctx);
        return;
    }

    if (cubeb_stream_init(ctx, &speaker_stream, "Speaker stream", NULL, NULL, NULL,
        &speaker_params, latency_frames, speaker_data_callback, state_callback, &audioData) != CUBEB_OK) {
        std::cerr << "Failed to open speaker stream" << std::endl;
        cubeb_stream_destroy(microphone_stream);
        cubeb_destroy(ctx);
        return;
    }

    cubeb_stream_start(microphone_stream);
    cubeb_stream_start(speaker_stream);

    // Wait for an external signal to stop the recording, e.g., setting audioData.keepRunning to false
    while (audioData.keepRunning.load()) {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }

    cubeb_stream_stop(microphone_stream);
    cubeb_stream_stop(speaker_stream);
    cubeb_stream_destroy(microphone_stream);
    cubeb_stream_destroy(speaker_stream);
    cubeb_destroy(ctx);
}

int main() {
    AudioData audioData;
    audioData.outputFileName = "stereo_mix.raw";

    std::thread audioThread(record_audio_with_cubeb, std::ref(audioData));
    std::thread mixingThread(mix_and_output, std::ref(audioData));

    std::this_thread::sleep_for(std::chrono::seconds(10));  // Control the duration of recording
    audioData.keepRunning = false;

    audioThread.join();
    mixingThread.join();
    return 0;
}

dgm3333 changed the title ~~Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows)?~~ Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows) using Cubeb (or alternative audio option)? Apr 16, 2024

dgm3333 changed the title ~~Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows) using Cubeb (or alternative audio option)?~~ Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows) eg Cubeb instead of SDL2? Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows) eg Cubeb instead of SDL2? #2062

Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows) eg Cubeb instead of SDL2? #2062

dgm3333 commented Apr 16, 2024 •

edited

Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows) eg Cubeb instead of SDL2? #2062

Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows) eg Cubeb instead of SDL2? #2062

Comments

dgm3333 commented Apr 16, 2024 • edited

dgm3333 commented Apr 16, 2024 •

edited