You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This solution is based on the idea of converting whole sentences to WAV files before sending them back to Amica, so the main downside is that it may introduce some delay, especially for longer sentences. To lower the latency, Amica may further split sentences (at commas), but this results in a bit worse sentence audio cohesion (unnaturally long pauses at commas, each part of the sentence in a bit different tone). There is also this bug in the XTTS api where it writes an incorrect sampling rate in the WAV header, so the played voice is slower and sounds lower than it should.
Because of that, I'm currently working on a dedicated streaming server (live conversion and sending samples) for Amica using XTTS. I already have a working solution with low latency (independent of sentence length), proper lip sync and text progression. The code is still very hacky, with all configuration hardcoded, so I will probably need a week or two before sharing it for testing.
Using https://github.com/daswer123/xtts-api-server is one option to have XTTS support, I've looked at the code though and it relies on the local filesystem to share voice files between the client and server.
There is also this https://github.com/coqui-ai/xtts-streaming-server not sure how it would be used though.
The text was updated successfully, but these errors were encountered: