Question about Output Quality Difference Between Local and Online Demo for MBZUAI/GLaMM-FullScope #39

Jayce1kk · 2024-03-26T09:45:22Z

Hello,

I've successfully run the demo locally and managed to obtain output results. However, I've noticed that the quality of the output significantly differs from what is showcased in the online demo, with the local results being notably inferior. I'm currently using the MBZUAI/GLaMM-FullScope for my tests. Could you please shed some light on why there might be such a discrepancy between the two?

Thank you for your assistance.

hanoonaR · 2024-03-26T12:43:35Z

Hi @Jayce1kk,

Thank you for your interest in our work. To assist further, could you provide the image, the prompt used, and a screenshot highlighting the differences between the model? We will try to replicate and address the issue. If necessary, we can also suggest an alternative checkpoint for you to try. Thank you.

Jayce1kk · 2024-03-27T02:47:43Z

Hi, Thank you very much for your reply！ I'm using the example given. The code Settings for app.py have not changed, I've just changed the model path to the local path.

hanoonaR · 2024-03-31T11:34:59Z

Hi @Jayce1kk,

Thank you for providing more details about your setup. The difference you're noticing is due to differences in the checkpoints and the model versions used in our live demo and the local setup - Our demo model is built on LLaVA 1.0, whereas the released code and models, including MBZUAI/GLaMM-FullScope, is built on LLaVA 1.5.

To achieve similar results to our live demo, please use this checkpoint: GLaMM-FullScope_v0. You'll want to make a few adjustments:

Change the --vision tower from openai/clip-vit-large-patch14-336 to openai/clip-vit-large-patch14. This modification is due to the input size of the global image encoder (CLIP image encoder) being 224 instead of 336. You'll also need to adjust token lengths in these lines (here and here) from 575 to 255, reflecting the change in input dimensions (224/14=16, resulting in 16*16=256 tokens).
Update the V-L projection layer (LLaVA 1.0 uses a single linear layer):
Replace self.mm_projector = nn.Sequential(*modules) with self.mm_projector = nn.Linear(config.mm_hidden_size, config.hidden_size)

We hope these adjustments help you in reproducing the demo results locally. Please don't hesitate to reach out if you encounter any issues or have further questions.

hanoonaR self-assigned this Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Output Quality Difference Between Local and Online Demo for MBZUAI/GLaMM-FullScope #39

Question about Output Quality Difference Between Local and Online Demo for MBZUAI/GLaMM-FullScope #39

Jayce1kk commented Mar 26, 2024

hanoonaR commented Mar 26, 2024

Jayce1kk commented Mar 27, 2024

hanoonaR commented Mar 31, 2024

Question about Output Quality Difference Between Local and Online Demo for MBZUAI/GLaMM-FullScope #39

Question about Output Quality Difference Between Local and Online Demo for MBZUAI/GLaMM-FullScope #39

Comments

Jayce1kk commented Mar 26, 2024

hanoonaR commented Mar 26, 2024

Jayce1kk commented Mar 27, 2024

hanoonaR commented Mar 31, 2024