Skip to content

anujsahani01/VoiceCloning-coqui-TTS

Repository files navigation

Voice Cloning using coqui-TTS

Finetuned coqui-TTS:

Voice cloning is the process of converting text input into natural and expressive synthetic speech using a pre-trained Text-to-Speech (TTS) model. In this project, we utilized the coqui-TTS model, which involves two main stages: Text and Audio Preprocessing and Acoustic Model Training. The text data undergoes tokenization and normalization, while the audio data is converted into Mel-frequency cepstral coefficients (MFCCs) or spectrograms. The acoustic model, typically implemented using neural networks like RNNs or transformers, learns the mapping between text representations and acoustic features. Finally, the vocoder synthesizes the acoustic features into high-quality waveforms, generating the desired speech output.

graph TD;
    Text2SpeechConversion-->AcousticModelTraining;
    AcousticModelTraining---VocoderTraining;
    VocoderTraining-->AcousticFeatures;
    AcousticFeatures-->SyntheticSpeechOutput;

Dataset Preparation:

For this project, we used a dataset consisting of 212 data points of Priyanka Chopra's voice notes and their corresponding transcriptions. The dataset was meticulously prepared in the following steps:

  • Extracted voice and subtitles from Priyanka Chopra's interview using yt-dlp.
  • Filtered the data to remove voices of other speakers from both audio and text data.
  • Cleaned the audio dataset using Audacity to enhance its usability.
  • Used this rnnoise to enhance the audio data for better model performance.
  • Transcribed the audio data to text data using openAI's whisper model.
  • Organized the dataset by creating a .csv file containing the audio file paths and corresponding dialogues.

Audio Dataset can be downloaded from here

Dataset Structure:

/converted
 | - data-01.wav
 | - data-01.wav
 | - data-01.wav

/metadata.csv

  | - data-01|speaker|Dialogue 1.
  | - data-02|speaker|Dialogue 2.
  | - data-03|speaker|Dialogue 3.

HyperParameters And Pretrained Model Weights:

To achieve the best results, we fine-tuned the coqui-TTS model using the following set of hyperparameters:

HyperParameters Used
batch_size=16
eval_batch_size=16
num_loader_workers=4
num_eval_loader_workers=4
run_eval=True
test_delay_epochs=-1
epochs=200
lr = 0.0005
text_cleaner="phoneme_cleaners"
use_phonemes= False
phoneme_language="en-us"
mixed_precision=True
save_step=7000

Results:

The following results were obtained by finetuning coqui-TTS model.

Average Loss Average Log MLE(Maximum Likelihood Estimation) Average Loader Time
0.2887064963579178 -0.2587181031703949 0.0015705227851867676

Use Trained Model from Commad line:

To use the trained model from the command line, you can follow the example command provided below:

model.pth and config.json

 !tts --text "Hi, I am an excellent Text to Speech cloning AI" \
      --model_path $model.pth\
      --config_path $comfig.json \
      --out_path out.wav
  import IPython
  IPython.display.Audio("out.wav")

Feedback:

If you have any feedback, please reach out to me at: LinkedIn

Author: @anujsahani01