Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTS Eval: Add TTS evaluation (MOS estimation) #2392

Open
wants to merge 38 commits into
base: develop
Choose a base branch
from

Conversation

flexthink
Copy link
Collaborator

What does this PR do?

Add TTS evaluation models trained on the SOMOS dataset
There should be no breaking changes

  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

@mravanelli mravanelli marked this pull request as ready for review February 9, 2024 20:19
@mravanelli mravanelli self-requested a review February 9, 2024 20:19
@mravanelli mravanelli added the enhancement New feature or request label Feb 9, 2024
@mravanelli
Copy link
Collaborator

Thank you @flexthink for your contribution! Having a model for MOS estimation is valuable for SpeechBrain. Here are some comments following an initial code inspection:

  1. The README.md file is missing.
  2. Recipe tests are not currently implemented.
  3. There seems to be an implementation of multiple systems; could we focus on integrating only the best one to help maintenance?
  4. We need to put the logs in the official Speechbrain Dropbox.
  5. The best model needs to be uploaded to HF (Hugging Face).
  6. In recipes/SOMOS/ttseval/contrastive_sampling.py, I suggest writing docstring examples for RegressionContrastiveEnhancement and ContrastivePairingPipeline.
  7. For ssl.py, docstring examples should be added for BaselineSSLFinetune and TransformerRegression.
  8. In metric_stats.py, docstring examples should be provided for LinearRegressionStats under MetricStats.
  9. There's a duplicative presence of somos_prepare.py. Could you confirm if the second instance is just a soft link?

I recommend @BenoitWang reviewing this PR as well. His insights would be valuable.

Thank you once again for your contribution.

@BenoitWang
Copy link
Collaborator

Hi @mravanelli @flexthink,

I guess this may not be the last version of the code since we have another branch where we did some latest experiments. I agree that we keep only the best recipe. Some of my observations in my benchmark:

  1. wavlm large gives the best corr for now
  2. l2 loss works better than l1 loss, but sometimes it does not converge if we start with l2, so I start with l1 for 1 epoch and then shift to l2.
  3. I added ljspeech into the training set and first did a ground-truth vs synthesized classification, then fine-tuned this model on the regression task, this helped gain also a little bit. To simplify the recipe we can provide the link of this classification model instead of adding the classification scripts.

As for the code, it seems good to me, the only thing I see here is that I don't find ssl.py very necessary, it wraps some basic speechbrain modules that we can put in the yaml file. What do you think @flexthink ?

self.pool = StatisticsPooling(return_std=False)
self.out_proj = Linear(n_neurons=1, input_size=d_model)

def forward(self, wav, length):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we can just add these modules in the yaml and call them in compute_forward, like this it may seem more clear to the users.

return x


def compute_feats_dim(model):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the feature dim I think we can declare it in the yaml file for example base=768/large=1024 like we did for other ssl recipes.


d_model: 512
d_ffn: 2048
num_layers: 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need to update the best config later for example wavlm large + 3-layer encoder, as well as the lr and the dropout, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants