Some weights of OtterForConditionalGeneration were not initialized from the model #270

xmc-andy · 2023-09-11T07:11:56Z

Hello, I encountered such an output when testing the trained weights. I spent a long time trying to find out the reason. Unfortunately, I haven't found out the cause of this problem yet. Can you help me?
I once used official weights to train a baseline on my own data for classification. The results were not very good, but the prompt "Some weights of OtterForConditionalGeneration were not initialized, and are newly initialized" did not appear. When I trained another version This situation occurred after testing the model.

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:30<00:00, 7.62s/it]
Some weights of OtterForConditionalGeneration were not initialized from the model checkpoint at /mnt/large_model/weights/BC4-partScale-negAug3 and are newly initialized: ['vision_encoder.vision_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Luodian · 2023-09-11T07:14:26Z

May I know your task type and which version of Otter model you are using for initialization?

xmc-andy · 2023-09-11T07:17:53Z

May I know your task type and which version of Otter model you are using for initialization?

I am doing a classification task, with multiple images and a single prompt as input, in SD data set format, and the pre-training weights are "OTTER-Image-MPT7B"

xmc-andy · 2023-09-11T07:19:07Z

export PYTHONPATH=.

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml
pipeline/train/instruction_following.py
--pretrained_model_name_or_path /mnt/large_model/weights/OTTER-Image-MPT7B_git
--mimicit_vt_path /mnt/large_model/output/XX/SD_instruction.json
--images_vt_path /mnt/large_model/output/XX/SD.json
--external_save_dir /mnt/large_model/output/XX/OTTER-Identify-Image-MPT7B-BC4-partScale-negAug3
--batch_size 1
--num_epochs 15
--run_name OTTER-Identify-Image-MPT7B-BC4-partScale-negAug3
--workers 24
--lr_scheduler cosine
--learning_rate 1e-5
--max-src-length 256
--warmup_steps_ratio 0.01
--save_ckpt_each_epoch
--delete_previous_checkpoint
--report_to_wandb \

Luodian · 2023-09-11T07:19:38Z

Will the missing weights log appear after you directly load the model? You may breakpoint after loading process finished.

xmc-andy · 2023-09-11T07:23:34Z

When loading the pre-trained weights you posted and the baseline weights I trained, there will be no log loss of weights, but there will be when loading the newly trained model weights.
Sorry, I can't find the location and reason of the log information

xmc-andy · 2023-09-11T07:27:28Z

By the way, due to network problems, the network cannot download tokenizer_config.json from huggingface's MPT, so I downloaded it offline through "https://huggingface.co/mosaicml/mpt-7b-instruct", except for the bin file and in modeling_otter.py The modified code is text_tokenizer = AutoTokenizer.from_pretrained("/mnt/train_pipeline-master/Otter/mpt-7b-instruct")

Luodian · 2023-09-11T07:50:11Z

Could you download the model's config from this path?

https://openxlab.org.cn/models/detail/YuanhanZhang/OTTER-Image-MPT7B

The config.json should be in following format:

{
  "_commit_hash": null,
  "_name_or_path": "/mnt/petrelfs/zhangyuanhan/weights/flamingo-mpt-7B",
  "architectures": [
    "OtterForConditionalGeneration"
  ],
  "cross_attn_every_n_layers": 4,
  "model_type": "otter",
  "text_config": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "MPTForCausalLM"
    ],
    "attn_config": {
      "alibi": true,
      "alibi_bias_max": 8,
      "attn_impl": "torch",
      "attn_pdrop": 0,
      "attn_type": "multihead_attention",
      "attn_uses_sequence_id": false,
      "clip_qkv": null,
      "prefix_lm": false,
      "qk_ln": false,
      "softmax_scale": null
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "d_model": 4096,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "emb_pdrop": 0,
    "embedding_fraction": 1.0,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "expansion_ratio": 4,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_size": 4096,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "init_config": {
      "emb_init_std": null,
      "emb_init_uniform_lim": null,
      "fan_mode": "fan_in",
      "init_div_is_residual": true,
      "init_gain": 0,
      "init_nonlinearity": "relu",
      "init_std": 0.02,
      "name": "kaiming_normal_",
      "verbose": 0
    },
    "init_device": "cpu",
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "learned_pos_emb": true,
    "length_penalty": 1.0,
    "logit_scale": null,
    "max_length": 20,
    "max_seq_len": 2048,
    "min_length": 0,
    "model_type": "mpt",
    "n_heads": 32,
    "n_layers": 32,
    "no_bias": true,
    "no_repeat_ngram_size": 0,
    "norm_type": "low_precision_layernorm",
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "resid_pdrop": 0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "tokenizer_name": "EleutherAI/gpt-neox-20b",
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.30.1",
    "typical_p": 1.0,
    "use_bfloat16": false,
    "use_cache": false,
    "verbose": 0,
    "vocab_size": 50432
  },
  "torch_dtype": "float32",
  "transformers_version": null,
  "use_media_placement_augmentation": true,
  "vision_config": {
    "_name_or_path": "openai/clip-vit-large-patch14",
    "add_cross_attention": false,
    "architectures": null,
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "quick_gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 224,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-05,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "clip_vision_model",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "projection_dim": 512,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": null,
    "torchscript": false,
    "transformers_version": "4.30.1",
    "typical_p": 1.0,
    "use_bfloat16": false
  }
}

And also, make sure you use the save_pretrained method to save checkpoints.

            unwrapped_model.save_pretrained(
                f"{args.external_save_dir}",
                is_main_process=accelerator.is_main_process,
                save_function=accelerator.save,
                state_dict=checkpoint_dict,
            )

Luodian · 2023-09-11T07:57:12Z

The missing of position_ids is usually from the LLM part. And make sure you are using the latest branch code for loading Otter model when init.

And now you can try pip install -U otter_ai.

And then from otter_ai import OtterForConditionalGeneration.

That will automatically handle to loading of modeling_mpt.py.

xmc-andy · 2023-09-11T08:31:03Z

I compared the config.json. Except for "_name_or_path" and "transformers_version", the rest are consistent with what you posted. This should not be the problem.
I used to convert the trained weights final_weights.pt through otter/converting_otter_pt_to_hf.py, and then load the weights from pretrained. Please tell me if this is correct? I found that when converting weights, the effect of using the config.json you posted and the config.json generated by training seems to be the same. Is there a difference?

{
"_commit_hash": null,
"_name_or_path": "/mnt/large_model/weights/OTTER-Image-MPT7B_git",
"architectures": [
"OtterForConditionalGeneration"
],
"cross_attn_every_n_layers": 4,
"model_type": "otter",
"text_config": {
"name_or_path": "",
"add_cross_attention": false,
"architectures": [
"MPTForCausalLM"
],
"attn_config": {
"alibi": true,
"alibi_bias_max": 8,
"attn_impl": "torch",
"attn_pdrop": 0,
"attn_type": "multihead_attention",
"attn_uses_sequence_id": false,
"clip_qkv": null,
"prefix_lm": false,
"qk_ln": false,
"softmax_scale": null
},
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": null,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"d_model": 4096,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": false,
"emb_pdrop": 0,
"embedding_fraction": 1.0,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": null,
"expansion_ratio": 4,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"hidden_size": 4096,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"init_config": {
"emb_init_std": null,
"emb_init_uniform_lim": null,
"fan_mode": "fan_in",
"init_div_is_residual": true,
"init_gain": 0,
"init_nonlinearity": "relu",
"init_std": 0.02,
"name": "kaiming_normal",
"verbose": 0
},
"init_device": "cpu",
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"learned_pos_emb": true,
"length_penalty": 1.0,
"logit_scale": null,
"max_length": 20,
"max_seq_len": 2048,
"min_length": 0,
"model_type": "mpt",
"n_heads": 32,
"n_layers": 32,
"no_bias": true,
"no_repeat_ngram_size": 0,
"norm_type": "low_precision_layernorm",
"num_beam_groups": 1,
"num_beams": 1,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": null,
"prefix": null,
"problem_type": null,
"pruned_heads": {},
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"resid_pdrop": 0,
"return_dict": true,
"return_dict_in_generate": false,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": true,
"tokenizer_class": null,
"tokenizer_name": "EleutherAI/gpt-neox-20b",
"top_k": 50,
"top_p": 1.0,
"torch_dtype": "bfloat16",
"torchscript": false,
"transformers_version": "4.31.0",
"typical_p": 1.0,
"use_bfloat16": false,
"use_cache": false,
"verbose": 0,
"vocab_size": 50432
},
"torch_dtype": "float32",
"transformers_version": null,
"use_media_placement_augmentation": true,
"vision_config": {
"_name_or_path": "openai/clip-vit-large-patch14",
"add_cross_attention": false,
"architectures": null,
"attention_dropout": 0.0,
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": null,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": false,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": null,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"hidden_act": "quick_gelu",
"hidden_size": 1024,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"image_size": 224,
"initializer_factor": 1.0,
"initializer_range": 0.02,
"intermediate_size": 4096,
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-05,
"length_penalty": 1.0,
"max_length": 20,
"min_length": 0,
"model_type": "clip_vision_model",
"no_repeat_ngram_size": 0,
"num_attention_heads": 16,
"num_beam_groups": 1,
"num_beams": 1,
"num_channels": 3,
"num_hidden_layers": 24,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": null,
"patch_size": 14,
"prefix": null,
"problem_type": null,
"projection_dim": 512,
"pruned_heads": {},
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"return_dict": true,
"return_dict_in_generate": false,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": true,
"tokenizer_class": null,
"top_k": 50,
"top_p": 1.0,
"torch_dtype": null,
"torchscript": false,
"transformers_version": "4.31.0",
"typical_p": 1.0,
"use_bfloat16": false
}
}

xmc-andy · 2023-09-11T08:32:55Z

I checked the save_pretrained part as you said, I'm using the version about a month ago, the save code is as follows"
unwrapped_model = accelerator.unwrap_model(model)
checkpoint_dict = get_checkpoint(model=unwrapped_model)
accelerator.save(
checkpoint_dict,
f"{args.external_save_dir}/final_weights.pt",
)
# save the config
unwrapped_model.config.save_pretrained(args.external_save_dir)"
I am not sure if this is part of the reason. I will try to use the latest branch to train new weights to see if the problem is solved. In addition, I will also try
“pip install - U otter_ai
from otter_ai import OtterForConditionalGeneration"
, thank you very much!

Luodian · 2023-09-11T08:35:59Z

I would suggest you to use the save_pretrained method (it's a function of Huggingface Transformers). This method directly dump all things of your current trained model to a path

You can also load it using OtterForConditionalGeneration.from_pretrained("path").

So this process would be safer and wont cause the missing weights problems.

xmc-andy · 2023-09-11T08:37:54Z

I would suggest you to use the save_pretrained method (it's a function of Huggingface Transformers). This method directly dump all things of your current trained model to a path

You can also load it using OtterForConditionalGeneration.from_pretrained("path").

So this process would be safer and wont cause the missing weights problems.

Got it, I'll take your advice and try it.

xmc-andy · 2023-09-11T09:02:53Z

Hey, I used to convert the trained weights final_weights.pt through otter/converting_otter_pt_to_hf.py, and then load the weights from pretrained. Please tell me if this is correct? I found that when converting weights, the effect of using the config.json you posted and the config.json generated by training seems to be the same. Is there a difference?

Luodian · 2023-09-11T09:13:40Z

It could be right. If you confirm that the config.json is the same.

xmc-andy · 2023-09-11T09:16:43Z

"transformers_version"

Got it, the generated config.json only has "_name_or_path" and ""transformers_version"" different from what you posted.

xmc-andy · 2023-09-11T12:10:50Z

It could be right. If you confirm that the config.json is the same.

Thank you very much for your careful answer. I have solved this bug. The cause of the problem is that the "_name_or_path" of the generated config.json is derived from the parameter "pretrained_model_name_or_path" during training, but during inference it seems that "_name_or_path" requires " flamingo" field, so using the config.json you posted instead of the generated config.json is valid.

xmc-andy · 2023-09-11T12:20:33Z

Hey, I would like to ask you that I am doing a two-classification task with single prompt and multiple images as input, but the result does not seem to be very good. Do you have any ideas for possible improvements? Currently I plan to try to unfreeze the visual encoder. Hope to share your suggestions.

Luodian · 2023-09-11T12:47:32Z

If you are doing with multiple images as input. You could first try to arrange them into the F dimension of the vision_x.

For this model training, I suggest you to add a max_num_frames=N where N is your maximum number input images.

You can still init from Image model, if with the max_num_frames variable, the model turns to a Video model. You can see that from the training log when init the model.

This is like treating your input images as video sequences.

Luodian · 2023-09-11T12:48:23Z

Also, another way is to put images in the in_context dim, where is one dim before F.

vision_x is in dimension of B, T, F, C, H, W.

If doing so, you wont need to add above mentioned max_num_frames=N.

Luodian · 2023-09-11T12:51:04Z

Or you could use --customized_config arg in instruction_following.py to dynamically load a new config.json file (this operation would overwrite model's config.json).

Inside this customized config, you can choose whether give the max_num_frames=N.

Luodian · 2023-09-11T12:51:30Z

xmc-andy · 2023-09-11T12:58:46Z

Thanks for sharing. I am now training Otter to treat multiple pictures as videos. Since the number of pictures is of variable length, batchsize=1 is currently used for processing. Later I will try to set the maximum number of frames to increase the batchsize and see if it will improve results.

iz2late · 2024-03-12T17:36:12Z

Thanks for sharing. I am now training Otter to treat multiple pictures as videos. Since the number of pictures is of variable length, batchsize=1 is currently used for processing. Later I will try to set the maximum number of frames to increase the batchsize and see if it will improve results.

any results for your multi-image input experiments? i'm planning to do similar things and wondering if you have any insights of which approach is better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some weights of OtterForConditionalGeneration were not initialized from the model #270

Some weights of OtterForConditionalGeneration were not initialized from the model #270

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023 •

edited

xmc-andy commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023 •

edited

Luodian commented Sep 11, 2023

xmc-andy commented Sep 11, 2023 •

edited

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023 •

edited

xmc-andy commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023

Luodian commented Sep 11, 2023 •

edited

Luodian commented Sep 11, 2023

Luodian commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

iz2late commented Mar 12, 2024

Some weights of OtterForConditionalGeneration were not initialized from the model #270

Some weights of OtterForConditionalGeneration were not initialized from the model #270

Comments

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023 • edited

xmc-andy commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023 • edited

Luodian commented Sep 11, 2023

xmc-andy commented Sep 11, 2023 • edited

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023 • edited

xmc-andy commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

Luodian commented Sep 11, 2023

Luodian commented Sep 11, 2023 • edited

Luodian commented Sep 11, 2023

Luodian commented Sep 11, 2023

xmc-andy commented Sep 11, 2023

iz2late commented Mar 12, 2024

Luodian commented Sep 11, 2023 •

edited

Luodian commented Sep 11, 2023 •

edited

xmc-andy commented Sep 11, 2023 •

edited

Luodian commented Sep 11, 2023 •

edited

Luodian commented Sep 11, 2023 •

edited