You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi
I want to fine tune "stt_en_fastconformer_hybrid_large_streaming_multi" on custom data.
In my dataset I have things like "Vitamin B12", "Code: c12r5", "hb1ac" etc
For these alphanumeric words:
Should I convert the above to "vitamin b twelve", "code c one two r five", "h b one a c" ..... for using the default tokenizer
Should I create a custom/new tokenizer for this?
If there is any other suggestion, please let me know.
Thank You
The text was updated successfully, but these errors were encountered:
If you want to finetune using the original tokenizer, yes you'll need to normalize all numbers to spoken words.
Changing tokenizer means you'll need a large amount of data to retrain the model, that is not suggested unless you have several thousand hours of speech to reach best results
If you want to finetune using the original tokenizer, yes you'll need to normalize all numbers to spoken words.
Changing tokenizer means you'll need a large amount of data to retrain the model, that is not suggested unless you have several thousand hours of speech to reach best results
How to use the original tokenizer?
I also created a discussion for this.
Hi
I want to fine tune "stt_en_fastconformer_hybrid_large_streaming_multi" on custom data.
In my dataset I have things like "Vitamin B12", "Code: c12r5", "hb1ac" etc
For these alphanumeric words:
If there is any other suggestion, please let me know.
Thank You
The text was updated successfully, but these errors were encountered: