Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Output Quality Difference Between Local and Online Demo for MBZUAI/GLaMM-FullScope #39

Open
Jayce1kk opened this issue Mar 26, 2024 · 3 comments
Assignees

Comments

@Jayce1kk
Copy link

Hello,

I've successfully run the demo locally and managed to obtain output results. However, I've noticed that the quality of the output significantly differs from what is showcased in the online demo, with the local results being notably inferior. I'm currently using the MBZUAI/GLaMM-FullScope for my tests. Could you please shed some light on why there might be such a discrepancy between the two?

Thank you for your assistance.

@hanoonaR
Copy link
Member

Hi @Jayce1kk,

Thank you for your interest in our work. To assist further, could you provide the image, the prompt used, and a screenshot highlighting the differences between the model? We will try to replicate and address the issue. If necessary, we can also suggest an alternative checkpoint for you to try. Thank you.

@Jayce1kk
Copy link
Author

ballon Hi, Thank you very much for your reply! I'm using the example given. The code Settings for app.py have not changed, I've just changed the model path to the local path.

@hanoonaR hanoonaR self-assigned this Mar 27, 2024
@hanoonaR
Copy link
Member

Hi @Jayce1kk,

Thank you for providing more details about your setup. The difference you're noticing is due to differences in the checkpoints and the model versions used in our live demo and the local setup - Our demo model is built on LLaVA 1.0, whereas the released code and models, including MBZUAI/GLaMM-FullScope, is built on LLaVA 1.5.

To achieve similar results to our live demo, please use this checkpoint: GLaMM-FullScope_v0. You'll want to make a few adjustments:

  1. Change the --vision tower from openai/clip-vit-large-patch14-336 to openai/clip-vit-large-patch14. This modification is due to the input size of the global image encoder (CLIP image encoder) being 224 instead of 336. You'll also need to adjust token lengths in these lines (here and here) from 575 to 255, reflecting the change in input dimensions (224/14=16, resulting in 16*16=256 tokens).

  2. Update the V-L projection layer (LLaVA 1.0 uses a single linear layer):
    Replace self.mm_projector = nn.Sequential(*modules) with self.mm_projector = nn.Linear(config.mm_hidden_size, config.hidden_size)

We hope these adjustments help you in reproducing the demo results locally. Please don't hesitate to reach out if you encounter any issues or have further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants