Voice Cloning using coqui-TTS

Finetuned coqui-TTS:

Voice cloning is the process of converting text input into natural and expressive synthetic speech using a pre-trained Text-to-Speech (TTS) model. In this project, we utilized the coqui-TTS model, which involves two main stages: Text and Audio Preprocessing and Acoustic Model Training. The text data undergoes tokenization and normalization, while the audio data is converted into Mel-frequency cepstral coefficients (MFCCs) or spectrograms. The acoustic model, typically implemented using neural networks like RNNs or transformers, learns the mapping between text representations and acoustic features. Finally, the vocoder synthesizes the acoustic features into high-quality waveforms, generating the desired speech output.

graph TD;
    Text2SpeechConversion-->AcousticModelTraining;
    AcousticModelTraining---VocoderTraining;
    VocoderTraining-->AcousticFeatures;
    AcousticFeatures-->SyntheticSpeechOutput;

Dataset Preparation:

For this project, we used a dataset consisting of 212 data points of Priyanka Chopra's voice notes and their corresponding transcriptions. The dataset was meticulously prepared in the following steps:

Extracted voice and subtitles from Priyanka Chopra's interview using yt-dlp.
Filtered the data to remove voices of other speakers from both audio and text data.
Cleaned the audio dataset using Audacity to enhance its usability.
Used this rnnoise to enhance the audio data for better model performance.
Transcribed the audio data to text data using openAI's whisper model.
Organized the dataset by creating a .csv file containing the audio file paths and corresponding dialogues.

Audio Dataset can be downloaded from here

Dataset Structure:

/converted
 | - data-01.wav
 | - data-01.wav
 | - data-01.wav

/metadata.csv

  | - data-01|speaker|Dialogue 1.
  | - data-02|speaker|Dialogue 2.
  | - data-03|speaker|Dialogue 3.

HyperParameters And Pretrained Model Weights:

To achieve the best results, we fine-tuned the coqui-TTS model using the following set of hyperparameters:

HyperParameters Used
batch_size=16
eval_batch_size=16
num_loader_workers=4
num_eval_loader_workers=4
run_eval=True
test_delay_epochs=-1
epochs=200
lr = 0.0005
text_cleaner="phoneme_cleaners"
use_phonemes= False
phoneme_language="en-us"
mixed_precision=True
save_step=7000

Results:

The following results were obtained by finetuning coqui-TTS model.

Average Loss	Average Log MLE(Maximum Likelihood Estimation)	Average Loader Time
0.2887064963579178	-0.2587181031703949	0.0015705227851867676

Use Trained Model from Commad line:

To use the trained model from the command line, you can follow the example command provided below:

model.pth and config.json

 !tts --text "Hi, I am an excellent Text to Speech cloning AI" \
      --model_path $model.pth\
      --config_path $comfig.json \
      --out_path out.wav
  import IPython
  IPython.display.Audio("out.wav")

Feedback:

If you have any feedback, please reach out to me at:

Author: @anujsahani01

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
Transcription_and_Data_Preperation.ipynb		Transcription_and_Data_Preperation.ipynb
Voice_Cloning.ipynb		Voice_Cloning.ipynb
metadata.csv		metadata.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Transcription_and_Data_Preperation.ipynb

Transcription_and_Data_Preperation.ipynb

Voice_Cloning.ipynb

Voice_Cloning.ipynb

metadata.csv

metadata.csv

requirements.txt

requirements.txt

Repository files navigation

Voice Cloning using coqui-TTS

Finetuned coqui-TTS:

Dataset Preparation:

Audio Dataset can be downloaded from here

Dataset Structure:

HyperParameters And Pretrained Model Weights:

Results:

Use Trained Model from Commad line:

Feedback:

About

Releases

Packages

Languages

anujsahani01/VoiceCloning-coqui-TTS

Folders and files

Latest commit

History

Repository files navigation

Voice Cloning using coqui-TTS

Finetuned coqui-TTS:

Dataset Preparation:

Audio Dataset can be downloaded from here

Dataset Structure:

HyperParameters And Pretrained Model Weights:

Results:

Use Trained Model from Commad line:

Feedback:

About

Topics

Resources

Stars

Watchers

Forks

Languages