Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding imagebind #30690

Open
wants to merge 75 commits into
base: main
Choose a base branch
from
Open

Conversation

EduardoPach
Copy link
Contributor

@EduardoPach EduardoPach commented May 7, 2024

What does this PR do?

This PR fixes #23240 by adding ImageBind model.

This is based on #26310 which is currently stale and the author said it would not have time to work on it (though welcome to help @dg845 ).

Taking into consideration the points raised by @dg845 here #26310 (comment) I'll focus on adding the text/image/audio portion and try to contact the authors.

Who can Review

@amyeroberts (?)

…MU) and update config classes for text and image modalities.
…ImageBind follows Audio Spectrogram Transformer audio processing).
…s/image processors to ImageBind's __init__.py file.
@EduardoPach EduardoPach changed the title [WIP] Adding imagebind Adding imagebind May 23, 2024
@EduardoPach
Copy link
Contributor Author

@amyeroberts I think we're good for a first review!

There are a few points that we should have in mind:

  1. I have converted only text, vision, and audio modalities so far as those are the only modalities that have their preprocess steps in the original repo. There are a few issues in the original repo from people mentioning how they've managed the preprocessing of depth and thermal so if we want to we can add these as well.
  2. I'm not sure how we should deal with videos as they fall in the vision modality and thus should be passed as pixel_values as well. Preprocessing is equal to the image version + a frame selection logic. Should we have a videos arg in the ImageBindImageProcessor (currently this processor is just a copy from CLIP)
  3. In the case we add thermal and depth as well following the issues in the original repo should these be passed in the ImageBindImageProcessor as well?
  4. When pushing ImageBindProcessor to the hub I've noticed that the args for ImageBindImageProcessor and ImageBindFeatureExtractor are within the same file preprocessor_config.json which is not ideal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New model] ImageBind: One Embedding Space To Bind Them All
2 participants