Anyres compatible fine-tuning of llava-1.6 mistral 7b and 34b #1347

arielnlee · 2024-03-27T20:39:45Z

Lowrank fine-tuning with anyres for the LLaVA Next models :)

awzhgw · 2024-04-12T05:44:21Z

that is a good pr.. i am finetune this pr ..thanks .

arielnlee · 2024-04-12T21:04:01Z

Ofc, glad you found it useful! I'm sure the author's version is far superior (<3 llava), but wanted to leave this here for others to use until we get the real magic :)

@awzhgw

awzhgw · 2024-04-14T09:50:03Z

@arielnlee I encountered an issue during the training process. I am using the LoRA fine-tuning method, and my data consists of two parts:

lots of Pure question-answering dialogues.
Image-question-answering dialogues.

During training, I found that the training speed for the first part of the dataset is very slow, and it is as slow as the first part. After investigation, I found that the reason is:

在train.py的LazySupervisedDataset 类的__getitem__ 方法：

        if 'image' in self.list_data_dict[i]:
            data_dict['image'] = image
            data_dict['image_size'] = image_size
        elif self.data_args.is_multimodal:
            # image does not exist in the data, but the model is multimodal
            crop_size = self.data_args.image_processor.crop_size
            data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
            data_dict['image_size'] = crop_size
        return data_dict

when i delete this code 👍

            elif self.data_args.is_multimodal:
                # image does not exist in the data, but the model is multimodal
                crop_size = self.data_args.image_processor.crop_size
                data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
                data_dict['image_size'] = crop_size
            return data_dict

the train process error:

Traceback (most recent call last):
  File "/export/App/training_platform/PinoModel/LLaVA/llava/train/train_mem.py", line 9, in <module>
    train(attn_implementation="flash_attention_2")
  File "/export/App/training_platform/PinoModel/LLaVA/llava/train/train.py", line 1092, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1854, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2744, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1964, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2152, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 491, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

how to fine-tuning for the pure text stage with very fast speed?

Can I do this? ? ? Finally trained a good llava model

rohithbojja · 2024-04-17T12:36:46Z

i got adapter_model.safetensor instead of adapter_model.bin after LORA FInetuning of 1.6-mistral and getting error as
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.30it/s]
Traceback (most recent call last):
File "/home/rohith/LLaVA-1.6-ft/scripts/merge_lora_weights.py", line 22, in
merge_lora(args)
File "/home/rohith/LLaVA-1.6-ft/scripts/merge_lora_weights.py", line 8, in merge_lora
tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, device_map='cpu')
File "/home/rohith/LLaVA-1.6-ft/llava/model/builder.py", line 112, in load_pretrained_model
mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu')
File "/home/rohith/miniconda3/envs/llava/lib/python3.10/site-packages/torch/serialization.py", line 986, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/rohith/miniconda3/envs/llava/lib/python3.10/site-packages/torch/serialization.py", line 435, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/rohith/miniconda3/envs/llava/lib/python3.10/site-packages/torch/serialization.py", line 416, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home/rohith/Documents/mistral-llava/mm_projector.bin'

when trying to merge model

awzhgw · 2024-04-18T01:09:28Z

@rohithbojja may be mode_path is error 。。 please give me :model_path, mode_base args

awzhgw · 2024-04-18T01:10:51Z

@rohithbojja

nohup python scripts/merge_lora_weights.py --model-path=../checkpoints/llava-v1.6-34b-xxx-lora-5000 --model-base=../checkpoints/llava-v1.6-34b --save-model-path=../checkpoints/llava-v1.6-34b-xxx-5000 &

rohithbojja · 2024-04-18T15:34:14Z

@rohithbojja

nohup python scripts/merge_lora_weights.py --model-path=../checkpoints/llava-v1.6-34b-xxx-lora-5000 --model-base=../checkpoints/llava-v1.6-34b --save-model-path=../checkpoints/llava-v1.6-34b-xxx-5000 &

ive fixed my adding "lora" to model-path

findalexli · 2024-04-20T23:30:31Z

Can you please provide some example of your training data?

system="""<|im_start|>system\nAnswer the questions.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),

I was wondering why you chose to add a new conversation format. I was trying to tune based on your PR and my existing data made for LLaVA 1.5 fine-tuning which uses 'V1' version, but currently running into issues where the tokenizer length is mismatching

rohithbojja · 2024-04-21T16:13:51Z

Checkout my Wandb logs.

https://wandb.ai/21b81a66a5/huggingface/runs/4pslu1px/overview?nw=nwuser21b81a66a5

And my notebook used to train

https://colab.research.google.com/drive/10OG4JsmSZ6kd8pyDxxhjHWkhK2ZOgVH4

rohithbojja · 2024-04-24T11:41:44Z

#!/bin/bash

deepspeed llava/train/train_mem.py
--lora_enable True --lora_r 16 --lora_alpha 32 --mm_projector_lr 2e-5
--deepspeed ./scripts/zero2.json
--model_name_or_path /home/rohith/llava-v1.6-mistral-7b-bnb-4bit/
--version mistral_instruct
--data_path /home/rohith/Desktop/vqa/vqa/images/filtered_dataset.json
--image_folder /home/rohith/Desktop/vqa/vqa/images/
--vision_tower openai/clip-vit-large-patch14-336
--mm_projector_type mlp2x_gelu
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--mm_patch_merge_type spatial_unpad
--image_aspect_ratio anyres
--group_by_modality_length False
--bf16 False
--fp16 True
--output_dir /home/rohith/LLaVA-1.6-ft/llava_lora_mistral_med/
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 500
--save_total_limit 5
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.05
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 4096
--gradient_checkpointing True
--dataloader_num_workers 4
--lazy_preprocess True
--report_to wandb \

using this script gives me error

ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.

and
using original model doesnt give any error,
used
panoyo9829/llava-v1.6-mistral-7b-bnb-4bit model

findalexli · 2024-04-24T15:06:28Z

Can anyone share the filtered_dataset json for the 34b training? Yours, Alex

…

On Apr 24, 2024 at 4:42 AM -0700, Rohith Bojja ***@***.***>, wrote: #!/bin/bash deepspeed llava/train/train_mem.py --lora_enable True --lora_r 16 --lora_alpha 32 --mm_projector_lr 2e-5 --deepspeed ./scripts/zero2.json --model_name_or_path /home/rohith/llava-v1.6-mistral-7b-bnb-4bit/ --version mistral_instruct --data_path /home/rohith/Desktop/vqa/vqa/images/filtered_dataset.json --image_folder /home/rohith/Desktop/vqa/vqa/images/ --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --mm_patch_merge_type spatial_unpad --image_aspect_ratio anyres --group_by_modality_length False --bf16 False --fp16 True --output_dir /home/rohith/LLaVA-1.6-ft/llava_lora_mistral_med/ --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 5 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.05 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 4096 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb \ using this script gives me error ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype. and using original model doesnt give any error, used panoyo9829/llava-v1.6-mistral-7b-bnb-4bit model — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

rohithbojja · 2024-04-24T21:43:58Z

@findalexli
Use this to download dataset

https://drive.google.com/file/d/1gYLOFaz7Mn-E2u9ksT0R2BOai7MnmNcm/view?usp=drivesdk

It has following structure
VQA-
1.images
|_img1
|_img2
2.train
|_filtered_dataset.json

1 image is truncated,
Remove it,
Use this to detect

from PIL import Image
import os

trunk_ = 0
def is_truncated(image_path):
try:
# Open the image file
img = Image.open(image_path)
# Check if the image is truncated by trying to load it
img.load()
return False # Image is not truncated
except Exception as e:
print(f"Error loading image {image_path}: {e}")
return True # Image is truncated or corrupt

def check_for_truncated_images(directory,trunk_):
# Iterate through all files in the directory
for filename in os.listdir(directory):
# Check if the file is an image
if filename.endswith(('.jpg', '.jpeg', '.png', '.gif', '.bmp')):
image_path = os.path.join(directory, filename)
if is_truncated(image_path):
print(f"The image {filename} in directory {directory} is truncated.")
trunk_ = 1
else:
trunk_=0
print(trunk_)

directory_path = '/workspace/vqa/images'
check_for_truncated_images(directory_path,0)

Also remove the entry in json.
Otherwise you'll end up failing at 30% or so

Good luck

diridiri · 2024-05-13T06:27:58Z

Hi, Thanks for working on private version of anyres llava.

I have done fintuning vicuna-v1.5-7b with anyres / spatial_unpad in same configuration as above, but the result doesn't seem to work out well on lmms-eval with MME score 357 / 224 (LLaVA-v1.5-7B : 1519 / 332).

Have you done some evaluation on public benchmarks and got similar score?

arielnlee added 2 commits March 27, 2024 16:21

fine-tune llava-1.6 mistral 7b and 34b

1bf4898

minor fix

40949ba

arielnlee force-pushed the main branch from e0b64be to 40949ba Compare April 1, 2024 02:13

arielnlee added 3 commits March 31, 2024 22:17

update conversation template for 34b fine-tune

53e4fcf

minor update

c889f4a

update anyres

ae157ec

Sato-Daichi mentioned this pull request May 1, 2024

llava 1.6 finetuning #1477

Closed

itay1542 mentioned this pull request May 22, 2024

[Question] Can I use current script to finetune or LoRA the 1.6 models? #1463

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anyres compatible fine-tuning of llava-1.6 mistral 7b and 34b #1347

Anyres compatible fine-tuning of llava-1.6 mistral 7b and 34b #1347

arielnlee commented Mar 27, 2024

awzhgw commented Apr 12, 2024

arielnlee commented Apr 12, 2024

awzhgw commented Apr 14, 2024 •

edited

rohithbojja commented Apr 17, 2024

awzhgw commented Apr 18, 2024

awzhgw commented Apr 18, 2024

rohithbojja commented Apr 18, 2024

findalexli commented Apr 20, 2024

rohithbojja commented Apr 21, 2024 •

edited

rohithbojja commented Apr 24, 2024

findalexli commented Apr 24, 2024 via email

rohithbojja commented Apr 24, 2024 •

edited

diridiri commented May 13, 2024

Anyres compatible fine-tuning of llava-1.6 mistral 7b and 34b #1347

Are you sure you want to change the base?

Anyres compatible fine-tuning of llava-1.6 mistral 7b and 34b #1347

Conversation

arielnlee commented Mar 27, 2024

awzhgw commented Apr 12, 2024

arielnlee commented Apr 12, 2024

awzhgw commented Apr 14, 2024 • edited

rohithbojja commented Apr 17, 2024

awzhgw commented Apr 18, 2024

awzhgw commented Apr 18, 2024

rohithbojja commented Apr 18, 2024

findalexli commented Apr 20, 2024

rohithbojja commented Apr 21, 2024 • edited

rohithbojja commented Apr 24, 2024

findalexli commented Apr 24, 2024 via email

rohithbojja commented Apr 24, 2024 • edited

diridiri commented May 13, 2024

awzhgw commented Apr 14, 2024 •

edited

rohithbojja commented Apr 21, 2024 •

edited

rohithbojja commented Apr 24, 2024 •

edited