Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about choosing multi-image input mode and replacing image decoder #279

Open
charlierabea opened this issue Sep 24, 2023 · 3 comments
Labels
area:dataset dataset related

Comments

@charlierabea
Copy link

charlierabea commented Sep 24, 2023

          Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:

def process_spot_the_difference(self, instruction_id, instruction, answer, image_ids, in_context_example_ids):

To achieve this, you may follow these steps:

  1. Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:
"MED_INS_00001": {
            "instruction":"XXX",
            "answer":"XXX.",
            "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
            "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
        },
  1. Modify this line from:
elif cur_train_id.startswith("SD"): 

to:

elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"): 

This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.

  1. Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \

to:

--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \

If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

Originally posted by @ZhangYuanhan-AI in #234 (comment)

I was delighted to stumble upon this remarkable project. Thank you for your valuable contribution.

I am now doing a medical image(with multi slices and one description for each patient) captioning task. According to the above comment, I formed the training data MED.json and MED_instruction. Here's how the instruction json looks like:
{
"meta": {
"version": "",
"time": "",
"author": ""
},
"data": {
"test_INS_00000": {
"instruction": "",
"answer": ".\n ",
"image_ids": [
"MED_IMG_1",
"MED_IMG_2",
"MED_IMG_3",
"MED_IMG_4",
"MED_IMG_5",
"MED_IMG_6",
"MED_IMG_7",
"MED_IMG_8",
"MED_IMG_9",
"MED_IMG_10",
"MED_IMG_11",
"MED_IMG_12",
"MED_IMG_13",
"MED_IMG_14",
"MED_IMG_15",
"MED_IMG_16",
"MED_IMG_17",
"MED_IMG_18",
"MED_IMG_19",
"MED_IMG_20",
"MED_IMG_21",
"MED_IMG_22",
"MED_IMG_23",
"MED_IMG_24"
],
"rel_ins_ids": []
},
.....
}

The version of Otter I'm using is the 8/17 commit, and I've successfully got the generated caption and evaluated them with BLEU and CIDEr. However, I accidentally discovered that using the VQA mode has on par performance compared to SD mode, and different instruction is resulting in more diverse performance. Does that mean the SD mode doesn't suit my training scenerio, and VQA mode can help me test my instructions?

Furthermore, I'm trying to use the BiomedCLIP image decoder like the LLaVA-Med paper did. However, the 0817 instruction_following.py had no customized_config statement, and adding customized_config statements on the instruction_following.py from the 0830 commit does nothing. The resulting checkpoint config still writes CLIP.

Here's the config.json I created as the 0830 commit suggested.
{
"model_type": "otter",
"cross_attn_every_n_layers": 4,
"tie_word_embeddings": false,
"use_media_placement_augmentation": true,
"only_attend_previous": true,
"text_config": {
"_name_or_path": "luodian/llama-7b-hf",
"model_type": "llama"
},
"vision_config": {
"_name_or_path": "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224",
"model_type": "clip_vision_model",
"hidden_size": 768,
"intermediate_size": 3072,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"image_size": 224,
"patch_size": 16
}
}

Looking forward to exploring this topic and citing you and your colleagues on any possible publication!

@king159 king159 added the area:dataset dataset related label Sep 25, 2023
@ZhangYuanhan-AI
Copy link
Collaborator

  • Does that mean the SD mode doesn't suit my training scenario, and VQA mode can help me test my instructions?
    In your case, one instruction pairing with multiple images, we recommend to use SD mode. Though the achieved performance based on SD mode or VQA mode might be same in your case, the SD mode is logically reasonable in your data construction scenario.

@charlierabea
Copy link
Author

  • Does that mean the SD mode doesn't suit my training scenario, and VQA mode can help me test my instructions?
    In your case, one instruction pairing with multiple images, we recommend to use SD mode. Though the achieved performance based on SD mode or VQA mode might be same in your case, the SD mode is logically reasonable in your data construction scenario.

Thank you so much for your reply. We'll continue on our SD experiment.
Regarding the vision encoder, do you have any solution to replacing it?

@ZhangYuanhan-AI
Copy link
Collaborator

Maybe one solution is injecting the parameter of "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224" into Otter checkpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dataset dataset related
Projects
None yet
Development

No branches or pull requests

3 participants