Able to merge 1.5B model, but unable to run eval #50

tanveer-sayyed · 2024-04-17T10:10:04Z

As per the instructions, we were able to merge the base model and finetuned model. But on running eval we get this error:

But we do not encounter the above error when we directly run the unmerged model. Why? Is this the right way?

training script:
deepspeed tinyllava/train/train.py
--deepspeed ./scripts/tiny_llava/zero3.json
--lora_enable True --lora_r 32 --lora_alpha 64
--model_name_or_path bczhou/TinyLLaVA-1.5B
--version phi
--data_path $DATA_PATH
--image_folder $IMAGE_PATH
--vision_tower bczhou/TinyLLaVA-1.5B-SigLIP
--mm_projector_type mlp2x_gelu
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--image_aspect_ratio pad
--group_by_modality_length False
--fp16 True
--output_dir $OUTPUT_DIR
--num_train_epochs 3
--per_device_train_batch_size 8
--per_device_eval_batch_size 4
--gradient_accumulation_steps 2
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 50000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 False
--model_max_length 3072
--gradient_checkpointing True
--dataloader_num_workers 15
--lazy_preprocess True
--report_to wandb \

baichuanzhou · 2024-04-18T08:11:37Z

The load_pretrained_model method heavily depends on the output_dir's name of your model. What did you name it? It appeas that load_pretrained_model method recognized your model as TinyLLaVA-3.1B instead of TinyLLaVA-1.5B.

Also, set conv_mode to v1 when training the TinyLLaVA-1.5B.

tanveer-sayyed · 2024-04-18T09:12:57Z

OUTPUT_DIR=/home/xxx/TinyLLaVABench/checkpoints/tiny-llava-base-TinyLLaVA-1.5B-finetune--phi-lora-al-new

Also, I assumed conv_mode should be used only during inference. Okay, will re-train by setting it to v1 and post the results here.

Lastly, just for info, my packages:

tokenizers==0.15.1
torch==2.0.1
transformers==4.37.2

tanveer-sayyed · 2024-04-18T09:53:24Z

after adding conv_mode

tanveer-sayyed · 2024-04-18T11:36:23Z

I guess it's due to phi that 3.1B is getting loaded, as per this line.

baichuanzhou · 2024-04-18T11:49:55Z

The 1.5B model used TinyLLaMA as its backbone. Why did you include phi in your model name?

tanveer-sayyed · 2024-04-19T05:56:06Z

Yes, my bad. Honestly, it was an ignorance from my end.

So I re-trained using this script:

OUTPUT_DIR=/home/xxx/TinyLLaVABench/checkpoints/tiny-llava-base-TinyLLaVA-1.5B-v1-finetune-lora-al-0419
deepspeed tinyllava/train/train.py \
    --deepspeed ./scripts/tiny_llava/zero3.json \
    --lora_enable True --lora_r 32 --lora_alpha 64 \
    --model_name_or_path bczhou/TinyLLaVA-1.5B \
    --version v1 \
    --data_path $DATA_PATH \
    --image_folder $IMAGE_PATH\
    --vision_tower bczhou/TinyLLaVA-1.5B-SigLIP \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --fp16 True \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 3072 \
    --gradient_checkpointing True \
    --dataloader_num_workers 15 \
    --lazy_preprocess True \
    --report_to wandb

And then merged using:

python scripts/merge_lora_weights.py \
--model-path /home/xxx/TinyLLaVABench/checkpoints/tiny-llava-base-TinyLLaVA-1.5B-v1-finetune-lora-al-0419 \
--model-base bczhou/TinyLLaVA-1.5B \
--save-model-path /home/xxx/TinyLLaVABench/checkpoints/tiny-llava-base-TinyLLaVA-1.5B-v1-finetune-lora-al-0419-merged

But while running the eval(run_tiny_llava.py) I encountered a series of errors...

... all of which were resolved by copy-pasting files from the finetuned model to the merged model. Is this approach incorrect?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Able to merge 1.5B model, but unable to run eval #50

Able to merge 1.5B model, but unable to run eval #50

tanveer-sayyed commented Apr 17, 2024 •

edited

baichuanzhou commented Apr 18, 2024

tanveer-sayyed commented Apr 18, 2024 •

edited

tanveer-sayyed commented Apr 18, 2024 •

edited

tanveer-sayyed commented Apr 18, 2024 •

edited

baichuanzhou commented Apr 18, 2024

tanveer-sayyed commented Apr 19, 2024 •

edited

Able to merge 1.5B model, but unable to run eval #50

Able to merge 1.5B model, but unable to run eval #50

Comments

tanveer-sayyed commented Apr 17, 2024 • edited

baichuanzhou commented Apr 18, 2024

tanveer-sayyed commented Apr 18, 2024 • edited

tanveer-sayyed commented Apr 18, 2024 • edited

tanveer-sayyed commented Apr 18, 2024 • edited

baichuanzhou commented Apr 18, 2024

tanveer-sayyed commented Apr 19, 2024 • edited

tanveer-sayyed commented Apr 17, 2024 •

edited

tanveer-sayyed commented Apr 18, 2024 •

edited

tanveer-sayyed commented Apr 18, 2024 •

edited

tanveer-sayyed commented Apr 18, 2024 •

edited

tanveer-sayyed commented Apr 19, 2024 •

edited