GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao | Zhejiang University, Sea AI Lab

PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.

We provide our implementation and pretrained models in this repository.

Visit our demo page for audio samples.

News

December, 2022: GenerSpeech (NeurIPS 2022) released at Github.

Key Features

Multi-level Style Transfer for expressive text-to-speech.
Enhanced model generalization to out-of-distribution (OOD) style reference.

Quick Started

We provide an example of how you can generate high-fidelity samples using GenerSpeech.

To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.

Support Datasets and Pretrained Models

You can use pretrained models we provide here, and data here. Details of each folder are as in follows:

Model	Dataset (16 kHz)	Discription
GenerSpeech	LibriTTS,ESD	Acousitic model (config)
HIFI-GAN	LibriTTS,ESD	Neural Vocoder
Encoder	/	Emotion Encoder

More supported datasets are coming soon.

Dependencies

A suitable conda environment named generspeech can be created and activated with:

conda env create -f environment.yaml
conda activate generspeech

Multi-GPU

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference (Zero-shot TTS)

Here we provide a speech synthesis pipeline using GenerSpeech.

Prepare GenerSpeech (acoustic model): Download and put checkpoint at checkpoints/GenerSpeech
Prepare HIFI-GAN (neural vocoder): Download and put checkpoint at checkpoints/trainset_hifigan
Prepare Emotion Encoder: Download and put checkpoint at checkpoints/Emotion_encoder.pt
Prepare dataset: Download and put statistical files at data/binary/training_set
Prepare path/to/reference_audio (16k): By default, GenerSpeech uses ASR + MFA to obtain the text-speech alignment from reference.

CUDA_VISIBLE_DEVICES=$GPU python inference/GenerSpeech.py --config modules/GenerSpeech/config/generspeech.yaml  --exp_name GenerSpeech --hparams="text='here we go',ref_audio='assets/0011_001570.wav'"

Generated wav files are saved in infer_out by default.

Train your own model

Data Preparation and Configuration

Set raw_data_dir, processed_data_dir, binary_data_dir in the config file, and download dataset to raw_data_dir.
Check preprocess_cls in the config file. The dataset structure needs to follow the processor preprocess_cls, or you could rewrite it according to your dataset. We provide a Libritts processor as an example in modules/GenerSpeech/config/generspeech.yaml
Download global emotion encoder to emotion_encoder_path. For more details, please refer to this branch.
Preprocess Dataset

# Preprocess step: unify the file structure.
python data_gen/tts/bin/preprocess.py --config $path/to/config
# Align step: MFA alignment.
python data_gen/tts/bin/train_mfa_align.py --config $path/to/config
# Binarization step: Binarize data for fast IO.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/config

You could also build a dataset via NATSpeech, which shares a common MFA data-processing procedure. We also provide our processed dataset (16kHz LibriTTS+ESD).

Training GenerSpeech

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml  --exp_name GenerSpeech --reset

Inference using GenerSpeech

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml  --exp_name GenerSpeech --infer

Acknowledgements

This implementation uses parts of the code from the following Github repos: FastDiff, NATSpeech, as described in our code.

Citations

If you find this code useful in your research, please cite our work:

@inproceedings{huanggenerspeech,
  title={GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech},
  author={Huang, Rongjie and Ren, Yi and Liu, Jinglin and Cui, Chenye and Zhao, Zhou},
  booktitle={Advances in Neural Information Processing Systems}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
data_gen/tts		data_gen/tts
egs		egs
inference		inference
modules		modules
tasks		tasks
utils		utils
vocoders		vocoders
.gitignore		.gitignore
LICENSE		LICENSE
environment.yaml		environment.yaml
readme.md		readme.md
requirements.txt		requirements.txt

License

Rongjiehuang/GenerSpeech

Folders and files

Latest commit

History

Repository files navigation