This project offers a deeper exploration of tttzof351's "Simple Transformer TTS" tutorial and code, enhanced with insights from Gemini Advanced, Google AI's language model:
- Tutorial (medium article): Build text-to-speech from scratch by tttzof351
- Tutorial Source Code: github.com/tttzof351/SimpleTransfromerTTS
It is a toy implementation of a transformer TTS with these main simplifications:
- without tokenizer
- without scaled pos-encoding
- without vocoder, only Griffin-Lim
The model was trained on the LJ Speech Dataset. The LJ Speech Dataset is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books.
Note: Also check the kaggle notebook Simple Transformer Text-to-Speech that is associated with this github repo where I am replicating the "Simple Transformer TTS" tutorial and code, leveraging the power of Google AI's Gemini for conceptual explanations.
- Introduction
- Model Architecture Breakdown
- How to Train the Transformer TTS on Kaggle ⭐
- Results
- Inference from Pre-trained Transformer TTS ⭐
- Observations
- Contributing
- Acknowledgments
- Code Commentary and Docstrings: Gemini has provided extensive comments and
docstrings directly within the Python source files of the st_tts
package. Explore files like
dataset.py
,model.py
,train.py
and others to find in-depth explanations. - Module Explanations: Gemini will break down the functionality of TTS system components and their relationships within the codebase, offering a clearer understanding of the system's architecture.
- Enhanced Learning Experience: Delve into the mechanics of Transformer-based TTS with the support of Gemini's advanced language processing, supplementing the original tutorial.
- Tutorial and code Foundation: We'll utilize tttzof351's "Simple Transformer TTS" tutorial and code as the foundation for exploration.
- Code Integration: Gemini's insights are seamlessly integrated into the
Python files (
dataset.py
,hyperparams.py
,model.py
, etc.) within the st_tts package as comments and docstrings. - README Guidance: This README provides an overview and directs users towards the commented code for detailed explanations.
- Code Exploration: Dive into the st_tts package and examine the Python files to find Gemini's detailed comments and docstrings.
- Tutorial Reference: Refer to the original tutorial for context and the baseline implementation.
Textual depiction of the TransformerTTS model's component interactions:
Text Input -> Preprocessing -> Encoder -> Decoder -> Postprocessing
-
Text Input: This is the initial text you want to convert to speech.
-
Preprocessing:
encoder_prenet
: Takes the text input, embeds characters or words, and applies linear transformations and convolutions.pos_encoding
: Injects positional information into the preprocessed text representation.
-
Encoder:
encoder_block_1
,encoder_block_2
,encoder_block_3
: A stack of Encoder blocks that process the preprocessed text representation using self-attention and feed-forward layers.- Each encoder block outputs an encoded representation capturing contextual information from the input text.
- After the final encoder block (
encoder_block_3
), the encoded representation is normalized usingnorm_memory
.
-
Decoder:
decoder_prenet
: Takes the mel-spectrogram target (typically from a teacher-forcing approach during training) and transforms it for use by the decoder.decoder_block_1
,decoder_block_2
,decoder_block_3
: A stack of Decoder blocks that generate the predicted mel-spectrogram.- Each decoder block uses self-attention to attend to its own outputs and encoder outputs (attention over encoded text).
- It also uses feed-forward layers for non-linear transformations.
- The final decoder block's output is projected using:
linear_1
: Projects the decoder output to mel-spectrogram features for the predicted spectrogram.linear_2
: Projects the decoder output for stop token prediction, indicating when speech has ended.
-
Postprocessing:
postnet
: Takes the predicted mel-spectrogram fromlinear_1
and refines it using convolutions for potentially improved quality.
Overall data flow: Encoded text representation from the encoder informs the decoder's mel-spectrogram prediction at each step. The decoder's output is then post-processed for potentially better quality.
Note: This is a simplified textual representation, and the actual model might have additional connections or skip-connections not explicitly shown here.
The Kaggle notebook can be found @ kaggle.com/code/raul23/simple-transformer-text-to-speech
Follow these steps to train the Transformer TTS on Kaggle:
-
Ensure you are using a GPU, preferably T4.
While the P100 GPU does support mixed precision training, its architecture limitations may result in smaller speed improvements compared to newer NVIDIA GPUs (e.g. T4) with dedicated tensor cores. Tensor cores are specifically designed to accelerate mixed precision computations, which may lead to more pronounced performance gains on newer hardware.
See GPU T4 vs P100 for more details.
-
Make sure you have added the necessary inputs for training the model in your notebook:
/kaggle/input/ljspeech-meta/metadata.csv
: Metadata CSV file containing text, audio filenames, etc. It is associated with the LJ Speech Dataset/kaggle/input/the-lj-speech-dataset/LJSpeech-1.1/wavs/
: Audio WAV files from the LJ Speech Dataset
-
Preparation and Configuration: Execute all other notebook cells, particularly hyperparams.py, where you can set important hyperparameters such as input and output paths,
batch_size
,step_print
,step_test
, andstep_save
. -
Training: To begin training the TTS model, click on 'Run All'. Each cell will be executed from the top of the notebook until the end. You will see the training loop displaying information about the training, such as the epoch, steps, train and test losses.
I used TensorBoard to track and visualize both train and test losses, as well as display spectrograms and audio data.
Useful references about TensorBoard:
Here's how I installed TensorBoard within a conda environment:
-
Install TensorBoard using conda:
conda install tensorboard
-
If you encounter a
ModuleNotFoundError: No module named 'chardet'
error when running TensorBoard in your terminal, installchardet
:pip install chardet
Note: TensorFlow is not necessary to be installed. However, TensorBoard will warn you that it will be running with a reduced feature set.
To use TensorBoard:
-
Open your terminal and run the following command:
tensorboard --logdir logs/
Here,
--logdir
points to your directory containing the log files generated while training a model, which includes relevant data such as train/test losses, weights, and audio. In this project, st_tts generates log files such asevents.out.tfevents.1714616902.c720ee4d6b8b.34.0
every 1000 steps (default value). -
Then, open your browser and go to http://localhost:6006/.
-
Press
CTRL+C
in the terminal to quit TensorBoard.
Here are the train and test losses after training the TTS model for 68k steps (~12h hours on T4). By comparison, tttzof351 trained their model for more than 400k steps (~1 day on V100).
The model was evaluated by generating audio samples of the phrase 'Hello, World' after each 1000 steps of training. You can listen here to the audio files generated after training for 1000, 34000, and 68000 steps.
I stopped training after 68000 steps due to reaching the 12-hour session limit for GPU training on Kaggle.
We will demonstrate how to perform inference from a pre-trained transformer text-to-speech (TTS) model trained on the LJ Speech Dataset. The model was trained by GitHub user tttzof351, and the provided weights were uploaded to Kaggle for convenience. The inference code presented is sourced from tttzof351's GitHub repository here and is also found in this kaggle notebook.
- First install the simple-transformer-tts package:
!pip install git+https://github.com/raul23/simple-transformer-tts#egg=simple-transformer-tts
- Import the following packages and libraries:
import IPython import torch from st_tts.hyperparams import hp from st_tts.melspecs import inverse_mel_spec_to_wav from st_tts.model import TransformerTTS from st_tts.text_to_seq import text_to_seq from st_tts.write_mp3 import write_mp3
- Load the pre-trained Transformer TTS model:
# Path to the saved model weights file train_saved_path = "/kaggle/input/simple-transfer-tts/pytorch/simple-transfer-tts/1/train_SimpleTransfromerTTS.pt" # Load the saved model weights state = torch.load(train_saved_path) # Initialize the model architecture model = TransformerTTS().cuda() # Load the model weights into the initialized model model.load_state_dict(state["model"])
- This is the function that will be used to generate speeches based on short texts:
# Define text and output file name # NOTE: The model is unable to generate audio for numbers or special symbols such as % def synthesize_text_to_speech(text="The quick brown fox jumps over the lazy dog", name_file="speech.mp3"): # Perform inference to generate mel spectrogram and gate output postnet_mel, gate = model.inference( text_to_seq(text).unsqueeze(0).cuda(), # gate_threshold=1e-5, # TODO: not supported with_tqdm = False ) # Generate audio from mel spectrogram audio = inverse_mel_spec_to_wav(postnet_mel.detach()[0].T) # Write audio to MP3 file write_mp3( audio.detach().cpu().numpy(), name_file ) # Display audio return IPython.display.Audio( audio.detach().cpu().numpy(), rate=hp.sr )
- Generate the speech based on your text:
text = '''Breaking news! Scientists have discovered a new exoplanet potentially capable of supporting life. Further research is ongoing.''' synthesize_text_to_speech(text)
You can listen here to the audio files generated based on different types of text (e.g. emotional, factual, poetry).
I found that training with the T4 GPU was quicker compared to the P100 GPU:
- T4: 550 seconds per 1000 steps
- P100: 750 seconds per 1000 steps
(Note: For each step, one batch of data is processed)
It might seem counterintuitive that the T4 would outperform the P100 in training, considering the P100's greater computational power. However, the reason is that the simple Transformer TTS is using mixed precision computations. This is evident from the following code snippets:
scaler = torch.cuda.amp.GradScaler()
from train.pywith torch.autocast(device_type='cuda', dtype=torch.float16)
from train.py
While both T4 and P100 support mixed precision, significant performance gains might not be observed on the P100.
ceshine trained a Wide ResNet model on CIFAR-10 and recorded the training times on T4 and P100 GPUs with and without mixed precision:
In the blog post, ceshine remarked the following:
- Training with mixed precision on T4 is almost twice as fast as with single precision, and consumes consistently less GPU memory.
- Training wide-resnet with mixed precision on P100 does not have any significant effect in terms of speed.
According to TensorFlow's Guide about Mixed Precision:
While mixed precision will run on most hardware, it will only speed up models on recent NVIDIA GPUs, Cloud TPUs and recent Intel CPUs. [...] The P100 has compute capability 6.0 and is not expected to show a significant speedup.
So in conclusion: while the P100 GPU does support mixed precision training, its architecture limitations may result in smaller speed improvements compared to newer NVIDIA GPUs (e.g. T4) with dedicated tensor cores. Tensor cores are specifically designed to accelerate mixed precision computations, which may lead to more pronounced performance gains on newer hardware.
We invite your contributions! To share insights or suggest improvements, please open an issue or submit a pull request.