Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fairseq voice cloning #3142

Closed
Poccapx opened this issue Nov 5, 2023 · 13 comments · Fixed by eginhard/coqui-tts#11 · May be fixed by #3500
Closed

Fairseq voice cloning #3142

Poccapx opened this issue Nov 5, 2023 · 13 comments · Fixed by eginhard/coqui-tts#11 · May be fixed by #3500
Labels
bug Something isn't working

Comments

@Poccapx
Copy link

Poccapx commented Nov 5, 2023

Describe the bug

There seems to be an issue of activating voice conversion in Coqui when using Fairseq models. Argument --speaker_wav works fine on identical text with the XTTS model, but with Fairseq it seems to be ignored. Have tried both .wav and .mp3, different lengths, file locations/names, with and without CUDA, several languages. There are no errors, just always the same generic male voice. Is this a known issue with voice cloning and Fairseq on Windows’ command line or is something wrong with my setup?

To Reproduce

No response

Expected behavior

No response

Logs

No response

Environment

Windows, tts.exe

Additional context

No response

@Poccapx Poccapx added the bug Something isn't working label Nov 5, 2023
@erogol
Copy link
Member

erogol commented Nov 8, 2023

Can you give us a code for us to reproduce the problem?

@Poccapx
Copy link
Author

Poccapx commented Nov 8, 2023

Just running with any Fairseq model normally, the same way as with XTTS (which clones just fine, version 2 included): tts.exe --use_cuda true --model_name tts_models/[lang]/fairseq/vits --text "Testing voice cloning with Fairseq on Windows." --speaker_wav Test.wav --out_path Fairseq.wav

@erogol
Copy link
Member

erogol commented Nov 8, 2023

@Poccapx
Copy link
Author

Poccapx commented Nov 8, 2023

The thing is that running tts.exe --use_cuda true --model_name tts_models/multilingual/multi-dataset/xtts_v2 --language_idx [lang] --text "Testing voice cloning with XTTS on Windows." --speaker_wav Test.wav --out_path XTTS.wav clones the voice perfectly fine. The problem is with Fairseq models, where the argument --speaker_wav seems to be ignored, using the generic male voice.

@Sharrnah
Copy link

@Poccapx
for non-voice cloning models, you need to run the resulting TTS audio through a voice conversion model. See
https://github.com/coqui-ai/TTS#voice-conversion-models

XTTS is a voice-cloning model which does this on its own. (And actually can't do without any cloning audio file).

@Poccapx
Copy link
Author

Poccapx commented Nov 10, 2023

That’s very important, thank you. Is there a list for models regarding --model_name "<language>/<dataset>/<model_name>"?

@Sharrnah
Copy link

pretty sure there is currently only one official one, and that is voice_conversion_models/multilingual/vctk/freevc24

(not sure if you have to keep the first part "voice_conversion_models" out from the --model_name argument, as i am not using the CLI)

you can find a list of all models here: https://github.com/coqui-ai/TTS/blob/dev/TTS/.models.json#L924

@Poccapx
Copy link
Author

Poccapx commented Nov 10, 2023

Right! In the string tts --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --source_wav <path/to/speaker/wav> --target_wav <path/to/reference/wav> what is the difference between arguments --out_path and --target_wav?

@Sharrnah
Copy link

--source_wav is the speech audio you want to convert.
--target_wav is the speech you want the source_wav to be converted into
--out_path is the finished converted audio.

@Poccapx
Copy link
Author

Poccapx commented Nov 12, 2023

Using voice_conversion_models/multilingual/vctk/freevc24 on top of a Fairseq output has worked. The voice cloning quality is nowhere near that of XTTS, but at least that way it’s possible to switch from the default male voice to a female one. For Fairseq as a non-voice cloning model, is the --speaker_wav argument always pointless or are there instances where it is used with Fairseq? Since it is present in these two examples, and that got me thinking that something was wrong with my initial setup.

@Sharrnah
Copy link

Sharrnah commented Nov 14, 2023

sorry for the late reply.

for the first example link, its because the tts_with_vc_to_file() function does the voice conversion internally already. (thats what the "with_vc" part in the function name means)

About your second example, i have actually no idea. I would guess it has to do with the encoder model and not with the TTS model. But thats just a guess. So maybe i was wrong and you can somehow convert speakers using some vocoder models. Haven't found anything in the documentation about it, so maybe ask in the discussions https://github.com/coqui-ai/TTS/discussions about it.

I hate to make advertising, but in case you want, you can give my Application Whispering Tiger a try. It has multiple TTS plugins (including coqui TTS) and together with the RVC Plugin and a RVCv2 model, you can have probably the currently best voice conversion available. (its currently windows only though)

@erogol
Copy link
Member

erogol commented Nov 28, 2023

should be fixed by now.

@erogol erogol closed this as completed Nov 28, 2023
@Nanshanelectrician
Copy link

你能试试这个吗? https://tts.readthedocs.io/en/latest/inference.html#example-voice-cloning-by-a-single-speaker-tts-model-combining-with-the -语音转换模型

AFAIR 终端不支持带有 VC 的 TTS

UnboundLocalError: cannot access local variable 'dataset' where it is not associated with a value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants