Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to confirm how the knowledge organization is implemented? #73

Open
hxypqr opened this issue Jan 12, 2024 · 1 comment
Open

Comments

@hxypqr
Copy link

hxypqr commented Jan 12, 2024

I don't quite understand how knowledge distillation is implemented here.

Whisper is trained on 680,000 hours of untagged data for autoregression. According to the content of the fourth section of the paper, our model is trained on 21,170 hours of data with pseudo-labels generated by Whisper, with the first and 32nd layer parameters frozen based on Whisper. This means that our model only needs to go through 21,170 hours of data with pseudo-labels and a model structure similar to Whisper, freezing the first and 32nd layers, using weighted KL divergence and label cross-entropy to achieve good results?

If this is the case, it is indeed a significant discovery, indicating that we can always reduce the model's parameters and inference time after pre-training the model using similar methods, without significant loss of accuracy.

Thank you in advance

@sanchit-gandhi
Copy link
Collaborator

That's almost right! We freeze the entire encoder (32 layers) and take the first and last layers of the decoder (2 layers). We then train the model on the knowledge distillation objective on 22k hours of data. You can read more about how we initialise and train the model here: https://github.com/huggingface/distil-whisper#3-approach-%EF%B8%8F

Given the model is pre-trained on so much data, the encoder representation of the audio data is extremely good. We then just need to train the first and last decoder layers to behave as the full original 32 decoder layers, which requires less data than full pre-training.

Section 9.2 of the paper gives a nice analysis of the effect of dataset size for distillation: https://arxiv.org/pdf/2311.00430.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants