Skip to content

A server rendered chatbot client for responding to audio streams

Notifications You must be signed in to change notification settings

Conrad-X/openai-voice-to-voice-bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

This project is based on server side rendered node application that serves multiple versions of a voice-to-voice functionality and acts as a server that responds to the clients through an audio format. The responses is generated using openAI chat completion using response streaming. The following shows the breakdown of the directory structure for the repository

.
├── files                              # (Not for official use) collection of server rendered file formats
├── public                             # Server rendered page is served through this directory
    ├── templates
        ├── file 
            ├── index.html
        ├── sockets
            ├── index.html
        ├── default          
            ├── index.html
    ├── uploads                         # Used to store the user voice files
├── scripts                             
    ├── test                            # (For testing purpose only) scripts to test third party libraries
├── package.json         
├── servers 
    ├── file-based-response-server.js   # 1st variation of server using file written audio responds 
    ├── sockets-server.js               # 2nd variation of the server uses sockets for audio communication
└── README.md

Getting Started

The servers folder comprises of multiple variations of approaches towards a voice-to-voice bot, all these apporaches and their the average time recorded for several conversations have also been recorded to create a comparison. This repository uses OpenAI tools for transcribing, response generation and text to speech synthesis. On the other hand it uses RecordRTC utility library to record audio in a complete file format and audio chunks. The first iteration of the voice to voice communication is depicted within the diagram below

The numbered circles show the four stages of processing required to achieve the results and going through each of the process sequentially can cause alot of latency issues approx (~ 15-20 seconds avg). To adopt parallelism within these four processes, multiple experiments were conducted in terms of approaches which are listed below. Follow the steps below to run the application through these approaches

  • Create a .env file comprising of the folling details

    OPENAI_API_KEY = "XXXXXXXXXXXXXX"
    
  • Install Dependencies

    npm install
    
  • Run the following commands

    • To run File based response server

      npm run start:file
      
    • To run Sockets based response server

      npm run start:socket
      
    • To run Event based response server

      npm run start:sse
      
    • Run the server on http://localhost:8000

File Based Response Server

The OpenAI API can create an audio buffer and also store the audio buffer into a file. The idea of this approach was to transfer the user's audio chunks in form data or a file. The transcribing step processes the file and converts it into a text and generates a streamed response through OpenAI. The chunk of stream are gathered in form of a sentence and sent to the speak method to be spoken out, the audio generated is saved in an audio file on the server which is played on the client as the audio data continue to stream in. The following diagram depicts the workflow

Pros

  • No audio buffer management on client side or server side.
  • The audio plays seemlessly as it's through a file on a server.
  • Relatively Faster

Cons

  • I/O overhead at scale
  • The infrastucture would require encrypted file system to ensure security and communication.
  • Archiving process would be required to manage the files.

Sockets Based Server

This approach uses full duplex communication between the client and server through sockets. The recording chunks are collected on the server by 3 seconds interval instead of sending a complete recording file to the server. As the chunks are received, they are transcibed and consolidated on the server and once the last chunk is received, they are immediately sent for response generation, this saves time in transcribing a large sentence. The generated response is streamed just like within the previous section and provided to the speaking utility which generates the audio response buffers which are again sent through the server. The following diagram depicts the workflow

Pros

  • Audio buffer management required on client and server side
  • Saves time on transcribing large audio files
  • Full duplex communication between server and client, less API calling overhead
  • Response not saved on files
  • Fast

Cons

  • Uses sockets, which can cause scaling problems
  • audio buffer management on client side can get tricky

Server-Sent-Events Based Server

The voice bot is an audio streaming solution, hence using server sent events instead of sockets to transfer the audio buffers emerges as a scalable solution. The audio data is received in form of a formdata at the server, which is transcribed and the chunks of transcribed text is brought together to generate audio buffers which are transferred to the clients through server sent events. The workflow of this solution is depicted in the diagram below

Pros

  • Audio buffer management required on client side
  • CSolution can scale with the underlying instances
  • Response not saved on files

Cons

  • audio buffer management on client side
  • audio buffer sequence is often compromised

Experiment Readings

All these formats have been tested over 50 conversations and their times are recorded to get a rough estimation on elapsed time on variable length user queries. The following table shows the readings

Server Formats Average Time (seconds)
File based 5.9
Sockets based 4.4
Event based 6.4

Color Palette

Link To Color Palette

Screenshot 2024-01-02 at 1 12 32 AM

Screenshots

The following shows the screenshot of the server rendered web page in action
Screenshot 2024-01-02 at 2 10 03 PM