CoCa: fix MultimodalTransformer init + Mask CLS token at end of seq #551

iejMac · 2023-06-26T13:10:43Z

No description provided.

iejMac · 2023-06-26T13:18:40Z

@gpucce thoughts? is there some way init_parameters could've somehow been called?

gpucce · 2023-06-26T13:19:56Z

No I think it used the default ones, I think the VisionTransformer doesn't call it either?

I mean it calls it but it does nothing

iejMac · 2023-06-26T13:22:59Z

but text calls, and also what confuses me is how text_projection works since its initialized with torch.empty (

open_clip/src/open_clip/transformer.py

Line 674 in fb72f4d

self.text_projection = nn.Parameter(torch.empty(width, output_dim))

)

rom1504 · 2023-06-26T14:55:44Z

first fix the CI, then see my comment in #550

…c/open_clip into fix_multimodal_transformer

…odal_transformer

gpucce · 2023-06-28T12:13:02Z

@iejMac I added one more change that should make this ready for the temptative retraining

iejMac · 2023-06-29T11:28:32Z

here is the run for this PR - https://wandb.ai//iejmac/open-clip/reports/CoCa-v2-fixes-comparison--Vmlldzo0NzY0ODIy

This reverts commit dce72e8.

iejMac · 2023-07-19T13:47:14Z

@rom1504 thought on this? These changes don't brake current checkpoints, one issue, and actually initialize the MultimodalTransformer. I can try to start a run on Stability

…sformer

gpucce · 2023-08-07T18:39:17Z

@rwightman @rom1504 @iejMac hi, I worked on this PR, as it is it has a few changes in tests, adds transformers compat and fixes the issues. This is the best working initialization I have found checking only up to 18 epochs.

If you have some time to check this would be useful

@JeniaJitsev I used a bit of laionize for this, but hopefully will have positive effect on mammut too.

This is the report of the first 18 epochs compared to the old coca run https://wandb.ai/gpucce/coca_tests/reports/CoCa-V2-ViT-B-32---Vmlldzo1MDc4NTkz

rwightman · 2023-10-11T19:18:10Z

@gpucce so discussing here so I might possibly combine this with #660 checks, this was days before my second child was born so yeah, it got lost in the stack but I did take a peek (and subsequently forgot).

The cls mask change, what does it do to existing trained CoCa models?

The weight init was commented out in ViT because that's how OpenAI left it for the vision tower (and I thought we might try a different init some day), it relied on default PyTorch init. But there they used randn for all Param attributes

The empty is indeed a WTF. I feel following the text encoder init is a better default than going fully with PyTorch defaults for the multimodal decoder though right? Any reason why you wanted to comment it all out? Could just add the call to init_parameters() and tweak the projection init if zeros was more desirable?

gpucce · 2023-10-11T19:43:14Z

@gpucce so discussing here so I might possibly combine this with #660 checks, this was days before my second child was born so yeah, it got lost in the stack but I did take a peek (and subsequently forgot).

The cls mask change, what does it do to existing trained CoCa models?

The weight init was commented out in ViT because that's how OpenAI left it for the vision tower (and I thought we might try a different init some day), it relied on default PyTorch init. But there they used randn for all Param attributes

The empty is indeed a WTF. I feel following the text encoder init is a better default than going fully with PyTorch defaults for the multimodal decoder though right? Any reason why you wanted to comment it all out? Could just add the call to init_parameters() and tweak the projection init if zeros was more desirable?

The cls mask thing appears to not affect the Perf of existing models in both zeroshot-classification and captioning.
And this is sort of expected because as it is it only prevents few tokens from attending to cls while the other way around works ok. I can rerun evals to confirm.

The .empty was my mistake from the final refactoring, I had a .Linear in the coca model and while moving it in transformer I lost it.

About the init I was going along with vision but indeed doing same as for text could be better, I couldn't run more experiments, even for the zero init could be that a small enough randn is slightly better. If you make it with text init and zeros a decision then I can run b-32 for some epochs as perf test on your PR branch. So works as a small sanity check on the PR effect on the model.

gpucce and others added 2 commits June 24, 2023 09:50

make as with cls at the end

852b2f4

transformer.py: MultimodalTransformer not using init_parameters

88f70ba

Merge branch 'fix_multimodal_transformer' of https://github.com/iejMa…

ad716d7

…c/open_clip into fix_multimodal_transformer

gpucce mentioned this pull request Jun 27, 2023

Mask CLS at the end of seq #550

Closed

test CI

dcb18d7

iejMac changed the title ~~transformer.py: MultimodalTransformer not using init_parameters~~ CoCa: fix MultimodalTransformer init + Mask CLS token at end of seq Jun 27, 2023

move back ci

491bb5c

iejMac mentioned this pull request Jun 27, 2023

CoCa v2: fixes and improvements #554

Open

5 tasks

gpucce added 2 commits June 28, 2023 13:55

split pooling

dce72e8

Merge remote-tracking branch 'origin/double_att_pool' into fix_multim…

623226c

…odal_transformer

gpucce and others added 3 commits July 4, 2023 16:48

Revert "split pooling"

10adb3e

This reverts commit dce72e8.

Merge branch 'main' into fix_multimodal_transformer

0e5fded

Merge branch 'main' into fix_multimodal_transformer

14f0121

gpucce added 9 commits July 31, 2023 12:04

minor fix to allow 0shot in training

0dfb922

white spaces

5e6716b

use default param init

48bf1d2

default lm head init

589983a

init lm head to zeros

7f88211

add coca to regression tests

b703d12

Merge remote-tracking branch 'upstream/main' into fix_multimodal_tran…

2937e86

…sformer

add transformer compat

f958139

reasonable compat

00d2206

gpucce added 2 commits September 9, 2023 13:52

Merge branch 'main' into fix_multimodal_transformer

47f04aa

remove transformers bound

b4fcd70

gpucce mentioned this pull request Sep 20, 2023

build_cls_mask() in CoCa TextTransfotmer #549

Open

gpucce mentioned this pull request Oct 11, 2023

Combining CLIPA-v2 and SigLIP (both big_vision based) models #660

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoCa: fix MultimodalTransformer init + Mask CLS token at end of seq #551

CoCa: fix MultimodalTransformer init + Mask CLS token at end of seq #551

iejMac commented Jun 26, 2023

iejMac commented Jun 26, 2023

gpucce commented Jun 26, 2023 •

edited

iejMac commented Jun 26, 2023

rom1504 commented Jun 26, 2023

gpucce commented Jun 28, 2023

iejMac commented Jun 29, 2023 •

edited

iejMac commented Jul 19, 2023

gpucce commented Aug 7, 2023 •

edited

rwightman commented Oct 11, 2023 •

edited

gpucce commented Oct 11, 2023 •

edited

CoCa: fix MultimodalTransformer init + Mask CLS token at end of seq #551

Are you sure you want to change the base?

CoCa: fix MultimodalTransformer init + Mask CLS token at end of seq #551

Conversation

iejMac commented Jun 26, 2023

iejMac commented Jun 26, 2023

gpucce commented Jun 26, 2023 • edited

iejMac commented Jun 26, 2023

rom1504 commented Jun 26, 2023

gpucce commented Jun 28, 2023

iejMac commented Jun 29, 2023 • edited

iejMac commented Jul 19, 2023

gpucce commented Aug 7, 2023 • edited

rwightman commented Oct 11, 2023 • edited

gpucce commented Oct 11, 2023 • edited

gpucce commented Jun 26, 2023 •

edited

iejMac commented Jun 29, 2023 •

edited

gpucce commented Aug 7, 2023 •

edited

rwightman commented Oct 11, 2023 •

edited

gpucce commented Oct 11, 2023 •

edited