Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoCa: fix MultimodalTransformer init + Mask CLS token at end of seq #551

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

iejMac
Copy link
Contributor

@iejMac iejMac commented Jun 26, 2023

No description provided.

@iejMac
Copy link
Contributor Author

iejMac commented Jun 26, 2023

@gpucce thoughts? is there some way init_parameters could've somehow been called?

@gpucce
Copy link
Contributor

gpucce commented Jun 26, 2023

No I think it used the default ones, I think the VisionTransformer doesn't call it either?

I mean it calls it but it does nothing

@iejMac
Copy link
Contributor Author

iejMac commented Jun 26, 2023

but text calls, and also what confuses me is how text_projection works since its initialized with torch.empty (

self.text_projection = nn.Parameter(torch.empty(width, output_dim))
)

@rom1504
Copy link
Collaborator

rom1504 commented Jun 26, 2023

first fix the CI, then see my comment in #550

@iejMac iejMac changed the title transformer.py: MultimodalTransformer not using init_parameters CoCa: fix MultimodalTransformer init + Mask CLS token at end of seq Jun 27, 2023
@iejMac iejMac mentioned this pull request Jun 27, 2023
5 tasks
@gpucce
Copy link
Contributor

gpucce commented Jun 28, 2023

@iejMac I added one more change that should make this ready for the temptative retraining

@iejMac
Copy link
Contributor Author

iejMac commented Jun 29, 2023

@iejMac
Copy link
Contributor Author

iejMac commented Jul 19, 2023

@rom1504 thought on this? These changes don't brake current checkpoints, one issue, and actually initialize the MultimodalTransformer. I can try to start a run on Stability

@gpucce
Copy link
Contributor

gpucce commented Aug 7, 2023

@rwightman @rom1504 @iejMac hi, I worked on this PR, as it is it has a few changes in tests, adds transformers compat and fixes the issues. This is the best working initialization I have found checking only up to 18 epochs.

If you have some time to check this would be useful

@JeniaJitsev I used a bit of laionize for this, but hopefully will have positive effect on mammut too.

This is the report of the first 18 epochs compared to the old coca run https://wandb.ai/gpucce/coca_tests/reports/CoCa-V2-ViT-B-32---Vmlldzo1MDc4NTkz

@rwightman
Copy link
Collaborator

rwightman commented Oct 11, 2023

@gpucce so discussing here so I might possibly combine this with #660 checks, this was days before my second child was born so yeah, it got lost in the stack but I did take a peek (and subsequently forgot).

The cls mask change, what does it do to existing trained CoCa models?

The weight init was commented out in ViT because that's how OpenAI left it for the vision tower (and I thought we might try a different init some day), it relied on default PyTorch init. But there they used randn for all Param attributes

The empty is indeed a WTF. I feel following the text encoder init is a better default than going fully with PyTorch defaults for the multimodal decoder though right? Any reason why you wanted to comment it all out? Could just add the call to init_parameters() and tweak the projection init if zeros was more desirable?

@gpucce
Copy link
Contributor

gpucce commented Oct 11, 2023

@gpucce so discussing here so I might possibly combine this with #660 checks, this was days before my second child was born so yeah, it got lost in the stack but I did take a peek (and subsequently forgot).

The cls mask change, what does it do to existing trained CoCa models?

The weight init was commented out in ViT because that's how OpenAI left it for the vision tower (and I thought we might try a different init some day), it relied on default PyTorch init. But there they used randn for all Param attributes

The empty is indeed a WTF. I feel following the text encoder init is a better default than going fully with PyTorch defaults for the multimodal decoder though right? Any reason why you wanted to comment it all out? Could just add the call to init_parameters() and tweak the projection init if zeros was more desirable?

The cls mask thing appears to not affect the Perf of existing models in both zeroshot-classification and captioning.
And this is sort of expected because as it is it only prevents few tokens from attending to cls while the other way around works ok. I can rerun evals to confirm.

The .empty was my mistake from the final refactoring, I had a .Linear in the coca model and while moving it in transformer I lost it.

About the init I was going along with vision but indeed doing same as for text could be better, I couldn't run more experiments, even for the zero init could be that a small enough randn is slightly better. If you make it with text init and zeros a decision then I can run b-32 for some epochs as perf test on your PR branch. So works as a small sanity check on the PR effect on the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants