Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extract 1024 width patch embeddings and CLS embedding #844

Open
alvaro-stylesage opened this issue Mar 21, 2024 · 1 comment
Open

Comments

@alvaro-stylesage
Copy link

alvaro-stylesage commented Mar 21, 2024

Hello, I have seen that any of encode_image, _encode_image or forward methods return img_latents and img_embeds in 768 dimension; this means after the last projection layer. However, in the /open_clip/model_configs/coca_ViT-L-14.json file you specify that the width of the vision encoder is 1024. I have 2 concerns:

  1. Why is the img_embeds size (1, 255, 768) for one image if there should be 256 patches?
  2. How can I get the raw embeddings after the vision encoder of size 1024?

Thanks!

@rwightman
Copy link
Collaborator

@alvaro-stylesage the coca embeds are a bit wrong... #458 (comment)

it 'works' but it's not 100% correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants