Wonder how to inference after finetuning. #295

5taku · 2023-07-26T09:27:57Z

Hi
I finetuned the llama 7b model using alpaca.

Below is the command I ran.

CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=1 --master_port=8090 train.py \
    --model_name_or_path ./model/weight/7B \
    --data_path /home/sulki/project/devops/my_own_data.json \
    --bf16 True \
    --output_dir ./output_my_own_data/7B \
    --num_train_epochs 3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_offload_opt_param.json" \
    --tf32 True	> my_own_data.log

Time passes and finetuning is complete.
Below is the final part of the log.

[2023-07-20 18:05:19,608] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-20 18:05:19,608] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-07-20 18:05:29,973] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 6.74B parameters
ninja: no work to do.
Time to load cpu_adam op: 2.9780356884002686 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
{'loss': 0.93, 'learning_rate': 1.9131861575179e-05, 'epoch': 0.22}
{'loss': 0.8989, 'learning_rate': 1.7640214797136038e-05, 'epoch': 0.43}
[2023-07-21 10:45:40,546] [WARNING] [stage3.py:1850:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.8909, 'learning_rate': 1.614856801909308e-05, 'epoch': 0.65}
{'loss': 0.8855, 'learning_rate': 1.4656921241050121e-05, 'epoch': 0.87}
{'loss': 0.8753, 'learning_rate': 1.316527446300716e-05, 'epoch': 1.09}
{'loss': 0.8607, 'learning_rate': 1.1673627684964201e-05, 'epoch': 1.3}
{'loss': 0.8557, 'learning_rate': 1.0181980906921243e-05, 'epoch': 1.52}
{'loss': 0.8521, 'learning_rate': 8.690334128878282e-06, 'epoch': 1.74}
{'loss': 0.8464, 'learning_rate': 7.198687350835323e-06, 'epoch': 1.95}
{'loss': 0.7611, 'learning_rate': 5.707040572792363e-06, 'epoch': 2.17}
{'loss': 0.7234, 'learning_rate': 4.2153937947494036e-06, 'epoch': 2.39}
{'loss': 0.7142, 'learning_rate': 2.723747016706444e-06, 'epoch': 2.6}
{'loss': 0.7086, 'learning_rate': 1.2321002386634846e-06, 'epoch': 2.82}
{'train_runtime': 403223.9352, 'train_samples_per_second': 2.194, 'train_steps_per_second': 0.017, 'train_loss': 0.8234077206364384, 'epoch': 3.0}

Below is the generated output_dir .

├── added_tokens.json
├── checkpoint-6000
│   ├── added_tokens.json
│   ├── config.json
│   ├── generation_config.json
│   ├── global_step6000
│   │   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   │   └── zero_pp_rank_0_mp_rank_00_model_states.pt
│   ├── latest
│   ├── rng_state.pth
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   ├── tokenizer.model
│   ├── trainer_state.json
│   ├── training_args.bin
│   └── zero_to_fp32.py
├── config.json
├── generation_config.json
├── global_step6912
│   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   └── zero_pp_rank_0_mp_rank_00_model_states.pt
├── latest
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.model
├── trainer_state.json
├── training_args.bin
└── zero_to_fp32.py

Here's how I've tried:

run zero_to_fp32.py

ptyhon zero_to_fp32.py

I created an output file name as result.bin and this file was created.
The model path is set to output_my_own_data/7B/checkpoint-6000 and the output_path is set to output_my_own_data/7B/result.bin.

.
├── added_tokens.json
├── checkpoint-6000
│   ├── added_tokens.json
│   ├── config.json
│   ├── generation_config.json
│   ├── global_step6000
│   │   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   │   └── zero_pp_rank_0_mp_rank_00_model_states.pt
│   ├── latest
│   ├── rng_state.pth
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   ├── tokenizer.model
│   ├── trainer_state.json
│   ├── training_args.bin
│   └── zero_to_fp32.py
├── config.json
├── generation_config.json
├── global_step6912
│   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   └── zero_pp_rank_0_mp_rank_00_model_states.pt
├── latest
~~├── result.bin~~
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.model
├── trainer_state.json
├── training_args.bin
└── zero_to_fp32.py

I loaded the model as below.

from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("./output_my_own_data/7B/tokenizer.model")
model = LlamaForCausalLM.from_pretrained("./output_my_own_data/7B/result.bin")

print()

The tokenizer is loaded, but the model has an error.

Exception has occurred: OSError
It looks like the config file at '/home/sulki/project/stanford_alpaca-main/output_tawos_34/7B/result.bin' is not a valid JSON file.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

  File "/home/sulki/project/stanford_alpaca-main/inference copy.py", line 4, in <module>
    model = LlamaForCausalLM.from_pretrained("/home/sulki/project/stanford_alpaca-main/output_tawos_34/7B/result.bin")
OSError: It looks like the config file at '/home/sulki/project/stanford_alpaca-main/output_tawos_34/7B/result.bin' is not a valid JSON file.

Which part is the problem?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wonder how to inference after finetuning. #295

Wonder how to inference after finetuning. #295

5taku commented Jul 26, 2023 •

edited

Wonder how to inference after finetuning. #295

Wonder how to inference after finetuning. #295

Comments

5taku commented Jul 26, 2023 • edited

5taku commented Jul 26, 2023 •

edited