Deep Convolutional TTS

A PyTorch implementation of "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention".

Setup

Requirements:

pytorch >= 1.3
librosa
scipy
numpy
matplotlib
unidecode
tqdm

Optional:

simpleaudio and num2words, if you want to run realtime.py
nltk for better text processing

Data

For audio preprocessing I mainly used Kyubyong's DCTTS code. I trained the model on the LJSpeech Dataset and the german samples from the CSS10 Dataset. You can find pretrained models below.

If you want want to train a model, you need to prepare your dataset:

Create a directory data for your dataset and a sub directory data/wav containing all your audio clips.
Run audio_processing.py -w data/wav -m data/mel -l data/lin.
Create a text file data/lines.txt containing the transcription of the audio clips in the following format:
```
 my-wav-file-000|Transciption of file my-wav-file-000.wav
 my-wav-file-001|Transciption of file my-wav-file-001.wav
 ...
```
Note that you don't need to remove umlauts or accents like ä, é, î, etc. This will be done automatically. If your transcipt contains abbreviations or numbers on the other hand, you will need to spell them out. For spelling out numbers you can install num2words and use spell_out_numbers from the script text_processing.py.

Training

After preparing the dataset you can start training the Text2Mel and SSRN networks. Run

train_text2mel.py -d path/to/dataset
train_ssrn.py -d path/to/dataset

By default, checkpoints will be saved every 10,000 steps, but you can also set -save_iter for a custom value. If you want to continue training from a checkpoint, use -r save/checkpoint-xxxxx.pth. For other options run train_text2mel.py -h, train_ssrn.py -h and have a look at config.py.

Generate speech

There are two scripts for generating audio:

With realtime.py you can type sentences in the terminal and the computer will read it out aloud. Run realtime.py --t2m text2mel-checkpoint.pth --ssrn ssrn-checkpoint.pth --lang en.

With synthesize.py text.txt you can generate a wav file from a given text file. Run it with the following arguments:

--t2m, --ssrn, -o: paths to the saved networks and output file (optional)
--max_N: The text file will be split into chunks not longer than this length (optional). If not given, it will pick the value used for training in config.py. Reducing this value might improve audio quality, but increases generating time for longer texts and introduces breaks in sentences.
--max_T: Number of mel frames to generate for each chunk (optional). If the endings of sentences are cut off, increase this value.
--lang: Language of the text (optional). Defaults to en and will be used to spell out numbers occuring in the text.

Samples

See here. All samples were generated with the models below.

Pretrained models

Lanuage	Dataset	Text2Mel	SSRN
English	LJ Speech	350k steps	350k steps
German	CSS10	150k steps	100k steps

Notes

I use layer norm, dropout and learning rate decay during training.
The audio quality seems to deteriorate at the end of generated audio samples. A workaround would be to set a low value for --max_N to reduce the length for each sample.

Acknowledgement

The audio preprocessing uses Kyubyong's DCTTS code. This repo also helped me with some difficulties I had during the implementation.
Also see this other PyTorch implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
samples		samples
README.md		README.md
audio_processing.py		audio_processing.py
config.py		config.py
data.py		data.py
modules.py		modules.py
networks.py		networks.py
realtime.py		realtime.py
synthesize.py		synthesize.py
text_processing.py		text_processing.py
train_ssrn.py		train_ssrn.py
train_text2mel.py		train_text2mel.py

mark-koch/deep-convolutional-tts

Folders and files

Latest commit

History

Repository files navigation

Deep Convolutional TTS

Setup

Data

Training

Generate speech

Samples

Pretrained models

Notes

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Languages