You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since Whisper is already so awesome I now want to push it a bit harder and simultaneously record input and output of a headset to different audio channels to generate a diarised transcription.
I'm only after a Windows solution, and a pure Linux or Mac solution would not solve the problem
I'm happy if the input (to headset speakers) is compressed to mono so there would only be two recording channels required.
Is there anyone else who has done/is doing this?
As far as I can tell SDL2 does not offer this capability.
Cubeb (Mozilla cross-platform audio library) does appear to be able to do this out of the box (see untested chatGPT code snippet below). https://github.com/mozilla/cubeb
I'm not sure why Whisper uses SDL2 - whether there may be reasons that shifting to Cubeb would be a poor option?
The cubeb dll is only 133kb whereas SDL2 is 1620 - so it would also seem more lightweight option whereas SDL seems to be bulked up with code for generating a standalone application.
There are a number of suggestions in other posts, but none of them are optimal:- #377 - Run two transcriptions and merge the text #961 - This uses ffmpeg to reroute output to input - but isn't merging the streams #1468 - Install an external loopback
I asked ChatGPT and it suggested the following as a base for a Cubeb option.
To accomplish audio recording with Cubeb, including both microphone input and system audio output (e.g., speakers) with each going to separate channels of a stereo recording, you'll need to use Cubeb's advanced audio stream handling capabilities. This task involves opening two separate streams in Cubeb: one for capturing microphone audio (input stream) and another for capturing system audio output via loopback (output stream). You then mix these two mono streams into a single stereo stream in your application's logic.
First, ensure you have the Cubeb library set up in your project (eg with vcpkg)
vcpkg install cubeb
/** AudioData Structure: Manages microphone and speaker buffers, mutex for synchronization, and file output.* Callbacks: Append incoming audio data to respective buffers.* Mix and Output: Mixes the microphone and speaker buffers into a stereo buffer and writes it to a file. This function should ideally be called periodically, perhaps from a separate thread or in response to a timer or sufficient buffer accumulation.* Main Function: Initializes Cubeb, sets up the streams, starts them, and handles cleanup.*/
#include<cubeb/cubeb.h>
#include<iostream>
#include<vector>
#include<mutex>
#include<fstream>
#include<thread>
#include<atomic>
#include<cstring>structAudioData {
std::vector<float> micBuffer;
std::vector<float> speakerBuffer;
std::mutex bufferMutex;
std::string outputFileName;
std::ofstream outputFile;
std::atomic<bool> keepRunning{ true };
AudioData() {
outputFile.open(outputFileName, std::ios::binary); // Open binary file for raw audio data
}
~AudioData() {
if (outputFile.is_open())
outputFile.close();
}
voidwriteToFile(const std::vector<float>& stereoBuffer) {
if (!outputFile.is_open()) return;
outputFile.write(reinterpret_cast<constchar*>(stereoBuffer.data()), stereoBuffer.size() * sizeof(float));
}
};
longmicrophone_data_callback(cubeb_stream* stream, void* user_data, constvoid* input_buffer, void* /* output_buffer */, long nframes) {
AudioData* data = static_cast<AudioData*>(user_data);
std::lock_guard<std::mutex> lock(data->bufferMutex);
constfloat* input = static_cast<constfloat*>(input_buffer);
data->micBuffer.insert(data->micBuffer.end(), input, input + nframes);
return nframes;
}
longspeaker_data_callback(cubeb_stream* stream, void* user_data, constvoid* /* input_buffer */, void* output_buffer, long nframes) {
AudioData* data = static_cast<AudioData*>(user_data);
std::lock_guard<std::mutex> lock(data->bufferMutex);
constfloat* output = static_cast<constfloat*>(output_buffer);
data->speakerBuffer.insert(data->speakerBuffer.end(), output, output + nframes);
return nframes;
}
voidstate_callback(cubeb_stream* stream, void* user_data, cubeb_state state) {
std::cout << "Stream state changed to " << state << std::endl;
}
voidmix_and_output(AudioData& audioData) {
while (audioData.keepRunning.load()) {
std::vector<float> mixedBuffer;
{
std::lock_guard<std::mutex> lock(audioData.bufferMutex);
auto minSize = std::min(audioData.micBuffer.size(), audioData.speakerBuffer.size());
mixedBuffer.reserve(minSize * 2);
for (size_t i = 0; i < minSize; i++) {
mixedBuffer.push_back(audioData.micBuffer[i]); // Left channel from microphone
mixedBuffer.push_back(audioData.speakerBuffer[i]); // Right channel from speakers
}
audioData.micBuffer.erase(audioData.micBuffer.begin(), audioData.micBuffer.begin() + minSize);
audioData.speakerBuffer.erase(audioData.speakerBuffer.begin(), audioData.speakerBuffer.begin() + minSize);
}
audioData.writeToFile(mixedBuffer);
std::this_thread::sleep_for(std::chrono::milliseconds(100)); // Reduce CPU usage
}
}
voidrecord_audio_with_cubeb(AudioData& audioData) {
cubeb* ctx = nullptr;
cubeb_stream* microphone_stream = nullptr;
cubeb_stream* speaker_stream = nullptr;
if (cubeb_init(&ctx, "Cubeb audio capture example", NULL) != CUBEB_OK) {
std::cerr << "Failed to initialize Cubeb" << std::endl;
return;
}
cubeb_stream_params microphone_params;
microphone_params.format = CUBEB_SAMPLE_FLOAT32NE;
microphone_params.rate = 48000;
microphone_params.channels = 1;
microphone_params.layout = CUBEB_LAYOUT_MONO;
cubeb_stream_params speaker_params;
speaker_params.format = CUBEB_SAMPLE_FLOAT32NE;
speaker_params.rate = 48000;
speaker_params.channels = 1;
speaker_params.layout = CUBEB_LAYOUT_MONO;
uint32_t latency_frames = 256; // Set latency frames to a reasonable defaultif (cubeb_stream_init(ctx, µphone_stream, "Microphone stream", NULL, µphone_params, NULL,
NULL, latency_frames, microphone_data_callback, state_callback, &audioData) != CUBEB_OK) {
std::cerr << "Failed to open microphone stream" << std::endl;
cubeb_destroy(ctx);
return;
}
if (cubeb_stream_init(ctx, &speaker_stream, "Speaker stream", NULL, NULL, NULL,
&speaker_params, latency_frames, speaker_data_callback, state_callback, &audioData) != CUBEB_OK) {
std::cerr << "Failed to open speaker stream" << std::endl;
cubeb_stream_destroy(microphone_stream);
cubeb_destroy(ctx);
return;
}
cubeb_stream_start(microphone_stream);
cubeb_stream_start(speaker_stream);
// Wait for an external signal to stop the recording, e.g., setting audioData.keepRunning to falsewhile (audioData.keepRunning.load()) {
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
cubeb_stream_stop(microphone_stream);
cubeb_stream_stop(speaker_stream);
cubeb_stream_destroy(microphone_stream);
cubeb_stream_destroy(speaker_stream);
cubeb_destroy(ctx);
}
intmain() {
AudioData audioData;
audioData.outputFileName = "stereo_mix.raw";
std::thread audioThread(record_audio_with_cubeb, std::ref(audioData));
std::thread mixingThread(mix_and_output, std::ref(audioData));
std::this_thread::sleep_for(std::chrono::seconds(10)); // Control the duration of recording
audioData.keepRunning = false;
audioThread.join();
mixingThread.join();
return0;
}
The text was updated successfully, but these errors were encountered:
dgm3333
changed the title
Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows)?
Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows) using Cubeb (or alternative audio option)?
Apr 16, 2024
dgm3333
changed the title
Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows) using Cubeb (or alternative audio option)?
Options for Simultaneous Live Diarised Headset Input and Output transcription (on Windows) eg Cubeb instead of SDL2?
Apr 16, 2024
Since Whisper is already so awesome I now want to push it a bit harder and simultaneously record input and output of a headset to different audio channels to generate a diarised transcription.
I'm only after a Windows solution, and a pure Linux or Mac solution would not solve the problem
I'm happy if the input (to headset speakers) is compressed to mono so there would only be two recording channels required.
Is there anyone else who has done/is doing this?
As far as I can tell SDL2 does not offer this capability.
Cubeb (Mozilla cross-platform audio library) does appear to be able to do this out of the box (see untested chatGPT code snippet below).
https://github.com/mozilla/cubeb
I'm not sure why Whisper uses SDL2 - whether there may be reasons that shifting to Cubeb would be a poor option?
The cubeb dll is only 133kb whereas SDL2 is 1620 - so it would also seem more lightweight option whereas SDL seems to be bulked up with code for generating a standalone application.
There are a number of suggestions in other posts, but none of them are optimal:-
#377 - Run two transcriptions and merge the text
#961 - This uses ffmpeg to reroute output to input - but isn't merging the streams
#1468 - Install an external loopback
I asked ChatGPT and it suggested the following as a base for a Cubeb option.
To accomplish audio recording with Cubeb, including both microphone input and system audio output (e.g., speakers) with each going to separate channels of a stereo recording, you'll need to use Cubeb's advanced audio stream handling capabilities. This task involves opening two separate streams in Cubeb: one for capturing microphone audio (input stream) and another for capturing system audio output via loopback (output stream). You then mix these two mono streams into a single stereo stream in your application's logic.
First, ensure you have the Cubeb library set up in your project (eg with vcpkg)
vcpkg install cubeb
The text was updated successfully, but these errors were encountered: