Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Can we distill for multiple langauges for distil-small-whisper #107

Open
Killshot667 opened this issue Apr 1, 2024 · 3 comments

Comments

@Killshot667
Copy link

Killshot667 commented Apr 1, 2024

I have seen several distillations for different single languages for distil-whisper (like en, de etc). But I have yet to come across some distil-whisper which has been trained to be multilingual. For my use case, I need to distil it on multiple languages. But I couldnt find any results related to this in the paper. I wanted to know if such an experiment has been conducted before, atleast for two languages, and if there any results available for any such training. Does it give good results for both the languages, or does it fail to learn in such a case (maybe because of only two decoder layers). If it fails, could there be some other possible reason other than the model being too small to accomodate multiple languages

@bil-ash
Copy link

bil-ash commented Apr 4, 2024

I too have the same question.
@sanchit-gandhi Please try distilling whisper-small on kathbath dataset and share the results.

@sanchit-gandhi
Copy link
Collaborator

Hey @Killshot667 - that's a great question, and super sorry for the late reply here! I'll defer to @eustlb, who has been running some preliminary experiments on distilling Whisper jointly for French and Spanish. You can read about the initial results and how to reproduce them on the README here: https://github.com/huggingface/distil-whisper/tree/main/training#3-language-mixing

@eustlb
Copy link
Collaborator

eustlb commented May 22, 2024

Hey @Killshot667! Thanks for raising this interesting point. Indeed, distillation has, for the moment, been targeted at single languages.

For distillation, the approach was initially to shrink the model as much as possible while maximizing its performance by training a smaller decoder on a targeted language. The idea is to trade the multilingual capacities of the 32 layers of the decoder for size and speed improvement brought by a smaller decoder (therefore with smaller learning capacities). In this context, two layers appeared to be Pareto optimal. Were we to train on a multilingual dataset, more decoder layers might be needed to enhance learning capacities. Such an adaptation of the student model’s decoder layers can be easily done by changing --decoder_layers when initializing.

Secondly, note there is nothing restraining a distilled model from having multilingual transcription capacities. First, the encoder is identical to Whisper’s, so robustness in creating a representation of speech for different languages remains unchanged. Secondly, when initializing the student model, we keep Whisper’s vocabulary and start from Whisper input embeddings, coming with inherent multilingual tokens. To this extent, the only thing restraining distil-large-v3 from being multilingual is the dataset it has been distilled on. You could perfectly train, for example, a 4-decoder-layer distilled model on European languages (easily done by pseudo-labeling each set with the correct --language flag as explained in language-mixing). Actually, language-mixing experiments showed that mixing close languages could improve the model’s performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants