Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some weights of OtterForConditionalGeneration were not initialized from the model #270

Open
xmc-andy opened this issue Sep 11, 2023 · 23 comments

Comments

@xmc-andy
Copy link

Hello, I encountered such an output when testing the trained weights. I spent a long time trying to find out the reason. Unfortunately, I haven't found out the cause of this problem yet. Can you help me?
I once used official weights to train a baseline on my own data for classification. The results were not very good, but the prompt "Some weights of OtterForConditionalGeneration were not initialized, and are newly initialized" did not appear. When I trained another version This situation occurred after testing the model.

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:30<00:00, 7.62s/it]
Some weights of OtterForConditionalGeneration were not initialized from the model checkpoint at /mnt/large_model/weights/BC4-partScale-negAug3 and are newly initialized: ['vision_encoder.vision_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

@Luodian
Copy link
Owner

Luodian commented Sep 11, 2023

May I know your task type and which version of Otter model you are using for initialization?

@xmc-andy
Copy link
Author

May I know your task type and which version of Otter model you are using for initialization?

I am doing a classification task, with multiple images and a single prompt as input, in SD data set format, and the pre-training weights are "OTTER-Image-MPT7B"

@xmc-andy
Copy link
Author

export PYTHONPATH=.

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml
pipeline/train/instruction_following.py
--pretrained_model_name_or_path /mnt/large_model/weights/OTTER-Image-MPT7B_git
--mimicit_vt_path /mnt/large_model/output/XX/SD_instruction.json
--images_vt_path /mnt/large_model/output/XX/SD.json
--external_save_dir /mnt/large_model/output/XX/OTTER-Identify-Image-MPT7B-BC4-partScale-negAug3
--batch_size 1
--num_epochs 15
--run_name OTTER-Identify-Image-MPT7B-BC4-partScale-negAug3
--workers 24
--lr_scheduler cosine
--learning_rate 1e-5
--max-src-length 256
--warmup_steps_ratio 0.01
--save_ckpt_each_epoch
--delete_previous_checkpoint
--report_to_wandb \

@Luodian
Copy link
Owner

Luodian commented Sep 11, 2023

Will the missing weights log appear after you directly load the model? You may breakpoint after loading process finished.

@xmc-andy
Copy link
Author

When loading the pre-trained weights you posted and the baseline weights I trained, there will be no log loss of weights, but there will be when loading the newly trained model weights.
Sorry, I can't find the location and reason of the log information

@xmc-andy
Copy link
Author

By the way, due to network problems, the network cannot download tokenizer_config.json from huggingface's MPT, so I downloaded it offline through "https://huggingface.co/mosaicml/mpt-7b-instruct", except for the bin file and in modeling_otter.py The modified code is text_tokenizer = AutoTokenizer.from_pretrained("/mnt/train_pipeline-master/Otter/mpt-7b-instruct")

@Luodian
Copy link
Owner

Luodian commented Sep 11, 2023

Could you download the model's config from this path?

https://openxlab.org.cn/models/detail/YuanhanZhang/OTTER-Image-MPT7B

The config.json should be in following format:

{
  "_commit_hash": null,
  "_name_or_path": "/mnt/petrelfs/zhangyuanhan/weights/flamingo-mpt-7B",
  "architectures": [
    "OtterForConditionalGeneration"
  ],
  "cross_attn_every_n_layers": 4,
  "model_type": "otter",
  "text_config": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "MPTForCausalLM"
    ],
    "attn_config": {
      "alibi": true,
      "alibi_bias_max": 8,
      "attn_impl": "torch",
      "attn_pdrop": 0,
      "attn_type": "multihead_attention",
      "attn_uses_sequence_id": false,
      "clip_qkv": null,
      "prefix_lm": false,
      "qk_ln": false,
      "softmax_scale": null
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "d_model": 4096,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "emb_pdrop": 0,
    "embedding_fraction": 1.0,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "expansion_ratio": 4,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_size": 4096,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "init_config": {
      "emb_init_std": null,
      "emb_init_uniform_lim": null,
      "fan_mode": "fan_in",
      "init_div_is_residual": true,
      "init_gain": 0,
      "init_nonlinearity": "relu",
      "init_std": 0.02,
      "name": "kaiming_normal_",
      "verbose": 0
    },
    "init_device": "cpu",
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "learned_pos_emb": true,
    "length_penalty": 1.0,
    "logit_scale": null,
    "max_length": 20,
    "max_seq_len": 2048,
    "min_length": 0,
    "model_type": "mpt",
    "n_heads": 32,
    "n_layers": 32,
    "no_bias": true,
    "no_repeat_ngram_size": 0,
    "norm_type": "low_precision_layernorm",
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "resid_pdrop": 0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "tokenizer_name": "EleutherAI/gpt-neox-20b",
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.30.1",
    "typical_p": 1.0,
    "use_bfloat16": false,
    "use_cache": false,
    "verbose": 0,
    "vocab_size": 50432
  },
  "torch_dtype": "float32",
  "transformers_version": null,
  "use_media_placement_augmentation": true,
  "vision_config": {
    "_name_or_path": "openai/clip-vit-large-patch14",
    "add_cross_attention": false,
    "architectures": null,
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "quick_gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 224,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-05,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "clip_vision_model",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "projection_dim": 512,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": null,
    "torchscript": false,
    "transformers_version": "4.30.1",
    "typical_p": 1.0,
    "use_bfloat16": false
  }
}

And also, make sure you use the save_pretrained method to save checkpoints.

            unwrapped_model.save_pretrained(
                f"{args.external_save_dir}",
                is_main_process=accelerator.is_main_process,
                save_function=accelerator.save,
                state_dict=checkpoint_dict,
            )

@Luodian
Copy link
Owner

Luodian commented Sep 11, 2023

The missing of position_ids is usually from the LLM part. And make sure you are using the latest branch code for loading Otter model when init.

And now you can try pip install -U otter_ai.

And then from otter_ai import OtterForConditionalGeneration.

That will automatically handle to loading of modeling_mpt.py.

@xmc-andy
Copy link
Author

xmc-andy commented Sep 11, 2023

I compared the config.json. Except for "_name_or_path" and "transformers_version", the rest are consistent with what you posted. This should not be the problem.
I used to convert the trained weights final_weights.pt through otter/converting_otter_pt_to_hf.py, and then load the weights from pretrained. Please tell me if this is correct? I found that when converting weights, the effect of using the config.json you posted and the config.json generated by training seems to be the same. Is there a difference?

{
"_commit_hash": null,
"_name_or_path": "/mnt/large_model/weights/OTTER-Image-MPT7B_git",
"architectures": [
"OtterForConditionalGeneration"
],
"cross_attn_every_n_layers": 4,
"model_type": "otter",
"text_config": {
"name_or_path": "",
"add_cross_attention": false,
"architectures": [
"MPTForCausalLM"
],
"attn_config": {
"alibi": true,
"alibi_bias_max": 8,
"attn_impl": "torch",
"attn_pdrop": 0,
"attn_type": "multihead_attention",
"attn_uses_sequence_id": false,
"clip_qkv": null,
"prefix_lm": false,
"qk_ln": false,
"softmax_scale": null
},
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": null,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"d_model": 4096,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": false,
"emb_pdrop": 0,
"embedding_fraction": 1.0,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": null,
"expansion_ratio": 4,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"hidden_size": 4096,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"init_config": {
"emb_init_std": null,
"emb_init_uniform_lim": null,
"fan_mode": "fan_in",
"init_div_is_residual": true,
"init_gain": 0,
"init_nonlinearity": "relu",
"init_std": 0.02,
"name": "kaiming_normal
",
"verbose": 0
},
"init_device": "cpu",
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"learned_pos_emb": true,
"length_penalty": 1.0,
"logit_scale": null,
"max_length": 20,
"max_seq_len": 2048,
"min_length": 0,
"model_type": "mpt",
"n_heads": 32,
"n_layers": 32,
"no_bias": true,
"no_repeat_ngram_size": 0,
"norm_type": "low_precision_layernorm",
"num_beam_groups": 1,
"num_beams": 1,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": null,
"prefix": null,
"problem_type": null,
"pruned_heads": {},
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"resid_pdrop": 0,
"return_dict": true,
"return_dict_in_generate": false,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": true,
"tokenizer_class": null,
"tokenizer_name": "EleutherAI/gpt-neox-20b",
"top_k": 50,
"top_p": 1.0,
"torch_dtype": "bfloat16",
"torchscript": false,
"transformers_version": "4.31.0",
"typical_p": 1.0,
"use_bfloat16": false,
"use_cache": false,
"verbose": 0,
"vocab_size": 50432
},
"torch_dtype": "float32",
"transformers_version": null,
"use_media_placement_augmentation": true,
"vision_config": {
"_name_or_path": "openai/clip-vit-large-patch14",
"add_cross_attention": false,
"architectures": null,
"attention_dropout": 0.0,
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": null,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": false,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": null,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"hidden_act": "quick_gelu",
"hidden_size": 1024,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"image_size": 224,
"initializer_factor": 1.0,
"initializer_range": 0.02,
"intermediate_size": 4096,
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-05,
"length_penalty": 1.0,
"max_length": 20,
"min_length": 0,
"model_type": "clip_vision_model",
"no_repeat_ngram_size": 0,
"num_attention_heads": 16,
"num_beam_groups": 1,
"num_beams": 1,
"num_channels": 3,
"num_hidden_layers": 24,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": null,
"patch_size": 14,
"prefix": null,
"problem_type": null,
"projection_dim": 512,
"pruned_heads": {},
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"return_dict": true,
"return_dict_in_generate": false,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": true,
"tokenizer_class": null,
"top_k": 50,
"top_p": 1.0,
"torch_dtype": null,
"torchscript": false,
"transformers_version": "4.31.0",
"typical_p": 1.0,
"use_bfloat16": false
}
}

@xmc-andy
Copy link
Author

I checked the save_pretrained part as you said, I'm using the version about a month ago, the save code is as follows"
unwrapped_model = accelerator.unwrap_model(model)
checkpoint_dict = get_checkpoint(model=unwrapped_model)
accelerator.save(
checkpoint_dict,
f"{args.external_save_dir}/final_weights.pt",
)
# save the config
unwrapped_model.config.save_pretrained(args.external_save_dir)"
I am not sure if this is part of the reason. I will try to use the latest branch to train new weights to see if the problem is solved. In addition, I will also try
“pip install - U otter_ai
from otter_ai import OtterForConditionalGeneration"
, thank you very much!

@Luodian
Copy link
Owner

Luodian commented Sep 11, 2023

I would suggest you to use the save_pretrained method (it's a function of Huggingface Transformers). This method directly dump all things of your current trained model to a path

You can also load it using OtterForConditionalGeneration.from_pretrained("path").

So this process would be safer and wont cause the missing weights problems.

@xmc-andy
Copy link
Author

I would suggest you to use the save_pretrained method (it's a function of Huggingface Transformers). This method directly dump all things of your current trained model to a path

You can also load it using OtterForConditionalGeneration.from_pretrained("path").

So this process would be safer and wont cause the missing weights problems.

Got it, I'll take your advice and try it.

@xmc-andy
Copy link
Author

Hey, I used to convert the trained weights final_weights.pt through otter/converting_otter_pt_to_hf.py, and then load the weights from pretrained. Please tell me if this is correct? I found that when converting weights, the effect of using the config.json you posted and the config.json generated by training seems to be the same. Is there a difference?

@Luodian
Copy link
Owner

Luodian commented Sep 11, 2023

It could be right. If you confirm that the config.json is the same.

@xmc-andy
Copy link
Author

"transformers_version"

Got it, the generated config.json only has "_name_or_path" and ""transformers_version"" different from what you posted.

@xmc-andy
Copy link
Author

It could be right. If you confirm that the config.json is the same.

Thank you very much for your careful answer. I have solved this bug. The cause of the problem is that the "_name_or_path" of the generated config.json is derived from the parameter "pretrained_model_name_or_path" during training, but during inference it seems that "_name_or_path" requires " flamingo" field, so using the config.json you posted instead of the generated config.json is valid.

@xmc-andy
Copy link
Author

Hey, I would like to ask you that I am doing a two-classification task with single prompt and multiple images as input, but the result does not seem to be very good. Do you have any ideas for possible improvements? Currently I plan to try to unfreeze the visual encoder. Hope to share your suggestions.

@Luodian
Copy link
Owner

Luodian commented Sep 11, 2023

If you are doing with multiple images as input. You could first try to arrange them into the F dimension of the vision_x.

For this model training, I suggest you to add a max_num_frames=N where N is your maximum number input images.

You can still init from Image model, if with the max_num_frames variable, the model turns to a Video model. You can see that from the training log when init the model.

This is like treating your input images as video sequences.

@Luodian
Copy link
Owner

Luodian commented Sep 11, 2023

Also, another way is to put images in the in_context dim, where is one dim before F.

vision_x is in dimension of B, T, F, C, H, W.

If doing so, you wont need to add above mentioned max_num_frames=N.

@Luodian
Copy link
Owner

Luodian commented Sep 11, 2023

Or you could use --customized_config arg in instruction_following.py to dynamically load a new config.json file (this operation would overwrite model's config.json).

Inside this customized config, you can choose whether give the max_num_frames=N.

@Luodian
Copy link
Owner

Luodian commented Sep 11, 2023

image

@xmc-andy
Copy link
Author

Thanks for sharing. I am now training Otter to treat multiple pictures as videos. Since the number of pictures is of variable length, batchsize=1 is currently used for processing. Later I will try to set the maximum number of frames to increase the batchsize and see if it will improve results.

@iz2late
Copy link

iz2late commented Mar 12, 2024

Thanks for sharing. I am now training Otter to treat multiple pictures as videos. Since the number of pictures is of variable length, batchsize=1 is currently used for processing. Later I will try to set the maximum number of frames to increase the batchsize and see if it will improve results.

any results for your multi-image input experiments? i'm planning to do similar things and wondering if you have any insights of which approach is better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants