Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Able to merge 1.5B model, but unable to run eval #50

Open
tanveer-sayyed opened this issue Apr 17, 2024 · 6 comments
Open

Able to merge 1.5B model, but unable to run eval #50

tanveer-sayyed opened this issue Apr 17, 2024 · 6 comments

Comments

@tanveer-sayyed
Copy link

tanveer-sayyed commented Apr 17, 2024

As per the instructions, we were able to merge the base model and finetuned model. But on running eval we get this error:

image

But we do not encounter the above error when we directly run the unmerged model. Why? Is this the right way?


training script:
deepspeed tinyllava/train/train.py
--deepspeed ./scripts/tiny_llava/zero3.json
--lora_enable True --lora_r 32 --lora_alpha 64
--model_name_or_path bczhou/TinyLLaVA-1.5B
--version phi
--data_path $DATA_PATH
--image_folder $IMAGE_PATH
--vision_tower bczhou/TinyLLaVA-1.5B-SigLIP
--mm_projector_type mlp2x_gelu
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--image_aspect_ratio pad
--group_by_modality_length False
--fp16 True
--output_dir $OUTPUT_DIR
--num_train_epochs 3
--per_device_train_batch_size 8
--per_device_eval_batch_size 4
--gradient_accumulation_steps 2
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 50000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 False
--model_max_length 3072
--gradient_checkpointing True
--dataloader_num_workers 15
--lazy_preprocess True
--report_to wandb \

@baichuanzhou
Copy link
Contributor

The load_pretrained_model method heavily depends on the output_dir's name of your model. What did you name it? It appeas that load_pretrained_model method recognized your model as TinyLLaVA-3.1B instead of TinyLLaVA-1.5B.

Also, set conv_mode to v1 when training the TinyLLaVA-1.5B.

@tanveer-sayyed
Copy link
Author

tanveer-sayyed commented Apr 18, 2024

OUTPUT_DIR=/home/xxx/TinyLLaVABench/checkpoints/tiny-llava-base-TinyLLaVA-1.5B-finetune--phi-lora-al-new

Also, I assumed conv_mode should be used only during inference. Okay, will re-train by setting it to v1 and post the results here.

Lastly, just for info, my packages:

tokenizers==0.15.1
torch==2.0.1
transformers==4.37.2

@tanveer-sayyed
Copy link
Author

tanveer-sayyed commented Apr 18, 2024

after adding conv_mode

image

@tanveer-sayyed
Copy link
Author

tanveer-sayyed commented Apr 18, 2024

I guess it's due to phi that 3.1B is getting loaded, as per this line.

@baichuanzhou
Copy link
Contributor

The 1.5B model used TinyLLaMA as its backbone. Why did you include phi in your model name?

@tanveer-sayyed
Copy link
Author

tanveer-sayyed commented Apr 19, 2024

Yes, my bad. Honestly, it was an ignorance from my end.

So I re-trained using this script:

OUTPUT_DIR=/home/xxx/TinyLLaVABench/checkpoints/tiny-llava-base-TinyLLaVA-1.5B-v1-finetune-lora-al-0419
deepspeed tinyllava/train/train.py \
    --deepspeed ./scripts/tiny_llava/zero3.json \
    --lora_enable True --lora_r 32 --lora_alpha 64 \
    --model_name_or_path bczhou/TinyLLaVA-1.5B \
    --version v1 \
    --data_path $DATA_PATH \
    --image_folder $IMAGE_PATH\
    --vision_tower bczhou/TinyLLaVA-1.5B-SigLIP \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --fp16 True \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 3072 \
    --gradient_checkpointing True \
    --dataloader_num_workers 15 \
    --lazy_preprocess True \
    --report_to wandb

And then merged using:

python scripts/merge_lora_weights.py \
--model-path /home/xxx/TinyLLaVABench/checkpoints/tiny-llava-base-TinyLLaVA-1.5B-v1-finetune-lora-al-0419 \
--model-base bczhou/TinyLLaVA-1.5B \
--save-model-path /home/xxx/TinyLLaVABench/checkpoints/tiny-llava-base-TinyLLaVA-1.5B-v1-finetune-lora-al-0419-merged

But while running the eval(run_tiny_llava.py) I encountered a series of errors...

  • image
  • image
  • image

... all of which were resolved by copy-pasting files from the finetuned model to the merged model. Is this approach incorrect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants