Skip to content

mark-koch/deep-convolutional-tts

Repository files navigation

Deep Convolutional TTS Open In Colab

A PyTorch implementation of "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention".

Setup

Requirements:

  • pytorch >= 1.3
  • librosa
  • scipy
  • numpy
  • matplotlib
  • unidecode
  • tqdm

Optional:

  • simpleaudio and num2words, if you want to run realtime.py
  • nltk for better text processing

Data

For audio preprocessing I mainly used Kyubyong's DCTTS code. I trained the model on the LJSpeech Dataset and the german samples from the CSS10 Dataset. You can find pretrained models below.

If you want want to train a model, you need to prepare your dataset:

  1. Create a directory data for your dataset and a sub directory data/wav containing all your audio clips.

  2. Run audio_processing.py -w data/wav -m data/mel -l data/lin.

  3. Create a text file data/lines.txt containing the transcription of the audio clips in the following format:

     my-wav-file-000|Transciption of file my-wav-file-000.wav
     my-wav-file-001|Transciption of file my-wav-file-001.wav
     ...
    

    Note that you don't need to remove umlauts or accents like ä, é, î, etc. This will be done automatically. If your transcipt contains abbreviations or numbers on the other hand, you will need to spell them out. For spelling out numbers you can install num2words and use spell_out_numbers from the script text_processing.py.

Training

After preparing the dataset you can start training the Text2Mel and SSRN networks. Run

  • train_text2mel.py -d path/to/dataset
  • train_ssrn.py -d path/to/dataset

By default, checkpoints will be saved every 10,000 steps, but you can also set -save_iter for a custom value. If you want to continue training from a checkpoint, use -r save/checkpoint-xxxxx.pth. For other options run train_text2mel.py -h, train_ssrn.py -h and have a look at config.py.

Generate speech

There are two scripts for generating audio:

With realtime.py you can type sentences in the terminal and the computer will read it out aloud. Run realtime.py --t2m text2mel-checkpoint.pth --ssrn ssrn-checkpoint.pth --lang en.

With synthesize.py text.txt you can generate a wav file from a given text file. Run it with the following arguments:

  • --t2m, --ssrn, -o: paths to the saved networks and output file (optional)
  • --max_N: The text file will be split into chunks not longer than this length (optional). If not given, it will pick the value used for training in config.py. Reducing this value might improve audio quality, but increases generating time for longer texts and introduces breaks in sentences.
  • --max_T: Number of mel frames to generate for each chunk (optional). If the endings of sentences are cut off, increase this value.
  • --lang: Language of the text (optional). Defaults to en and will be used to spell out numbers occuring in the text.

Samples

See here. All samples were generated with the models below.

Pretrained models

Lanuage Dataset Text2Mel SSRN
English LJ Speech 350k steps 350k steps
German CSS10 150k steps 100k steps

Notes

  • I use layer norm, dropout and learning rate decay during training.
  • The audio quality seems to deteriorate at the end of generated audio samples. A workaround would be to set a low value for --max_N to reduce the length for each sample.

Acknowledgement

Releases

No releases published

Packages

No packages published

Languages