Tokenizer suggestion for fine tuning cache aware streaming model #9124

rkchamp25 · 2024-05-07T05:06:08Z

Hi
I want to fine tune "stt_en_fastconformer_hybrid_large_streaming_multi" on custom data.
In my dataset I have things like "Vitamin B12", "Code: c12r5", "hb1ac" etc
For these alphanumeric words:

Should I convert the above to "vitamin b twelve", "code c one two r five", "h b one a c" ..... for using the default tokenizer
Should I create a custom/new tokenizer for this?

If there is any other suggestion, please let me know.
Thank You

titu1994 · 2024-05-07T05:50:53Z

If you want to finetune using the original tokenizer, yes you'll need to normalize all numbers to spoken words.

Changing tokenizer means you'll need a large amount of data to retrain the model, that is not suggested unless you have several thousand hours of speech to reach best results

bfss · 2024-05-17T10:16:16Z

If you want to finetune using the original tokenizer, yes you'll need to normalize all numbers to spoken words.

Changing tokenizer means you'll need a large amount of data to retrain the model, that is not suggested unless you have several thousand hours of speech to reach best results

How to use the original tokenizer?
I also created a discussion for this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer suggestion for fine tuning cache aware streaming model #9124

Tokenizer suggestion for fine tuning cache aware streaming model #9124

rkchamp25 commented May 7, 2024

titu1994 commented May 7, 2024

bfss commented May 17, 2024

Tokenizer suggestion for fine tuning cache aware streaming model #9124

Tokenizer suggestion for fine tuning cache aware streaming model #9124

Comments

rkchamp25 commented May 7, 2024

titu1994 commented May 7, 2024

bfss commented May 17, 2024