Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wonder how to inference after finetuning. #295

Open
5taku opened this issue Jul 26, 2023 · 0 comments
Open

Wonder how to inference after finetuning. #295

5taku opened this issue Jul 26, 2023 · 0 comments

Comments

@5taku
Copy link

5taku commented Jul 26, 2023

Hi
I finetuned the llama 7b model using alpaca.

Below is the command I ran.

CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=1 --master_port=8090 train.py \
    --model_name_or_path ./model/weight/7B \
    --data_path /home/sulki/project/devops/my_own_data.json \
    --bf16 True \
    --output_dir ./output_my_own_data/7B \
    --num_train_epochs 3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_offload_opt_param.json" \
    --tf32 True	> my_own_data.log	

Time passes and finetuning is complete.
Below is the final part of the log.

[2023-07-20 18:05:19,608] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-20 18:05:19,608] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-07-20 18:05:29,973] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 6.74B parameters
ninja: no work to do.
Time to load cpu_adam op: 2.9780356884002686 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
{'loss': 0.93, 'learning_rate': 1.9131861575179e-05, 'epoch': 0.22}
{'loss': 0.8989, 'learning_rate': 1.7640214797136038e-05, 'epoch': 0.43}
[2023-07-21 10:45:40,546] [WARNING] [stage3.py:1850:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.8909, 'learning_rate': 1.614856801909308e-05, 'epoch': 0.65}
{'loss': 0.8855, 'learning_rate': 1.4656921241050121e-05, 'epoch': 0.87}
{'loss': 0.8753, 'learning_rate': 1.316527446300716e-05, 'epoch': 1.09}
{'loss': 0.8607, 'learning_rate': 1.1673627684964201e-05, 'epoch': 1.3}
{'loss': 0.8557, 'learning_rate': 1.0181980906921243e-05, 'epoch': 1.52}
{'loss': 0.8521, 'learning_rate': 8.690334128878282e-06, 'epoch': 1.74}
{'loss': 0.8464, 'learning_rate': 7.198687350835323e-06, 'epoch': 1.95}
{'loss': 0.7611, 'learning_rate': 5.707040572792363e-06, 'epoch': 2.17}
{'loss': 0.7234, 'learning_rate': 4.2153937947494036e-06, 'epoch': 2.39}
{'loss': 0.7142, 'learning_rate': 2.723747016706444e-06, 'epoch': 2.6}
{'loss': 0.7086, 'learning_rate': 1.2321002386634846e-06, 'epoch': 2.82}
{'train_runtime': 403223.9352, 'train_samples_per_second': 2.194, 'train_steps_per_second': 0.017, 'train_loss': 0.8234077206364384, 'epoch': 3.0}

Below is the generated output_dir .

├── added_tokens.json
├── checkpoint-6000
│   ├── added_tokens.json
│   ├── config.json
│   ├── generation_config.json
│   ├── global_step6000
│   │   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   │   └── zero_pp_rank_0_mp_rank_00_model_states.pt
│   ├── latest
│   ├── rng_state.pth
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   ├── tokenizer.model
│   ├── trainer_state.json
│   ├── training_args.bin
│   └── zero_to_fp32.py
├── config.json
├── generation_config.json
├── global_step6912
│   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   └── zero_pp_rank_0_mp_rank_00_model_states.pt
├── latest
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.model
├── trainer_state.json
├── training_args.bin
└── zero_to_fp32.py

Here's how I've tried:

  1. run zero_to_fp32.py
ptyhon zero_to_fp32.py 

I created an output file name as result.bin and this file was created.
The model path is set to output_my_own_data/7B/checkpoint-6000 and the output_path is set to output_my_own_data/7B/result.bin.

.
├── added_tokens.json
├── checkpoint-6000
│   ├── added_tokens.json
│   ├── config.json
│   ├── generation_config.json
│   ├── global_step6000
│   │   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   │   └── zero_pp_rank_0_mp_rank_00_model_states.pt
│   ├── latest
│   ├── rng_state.pth
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   ├── tokenizer.model
│   ├── trainer_state.json
│   ├── training_args.bin
│   └── zero_to_fp32.py
├── config.json
├── generation_config.json
├── global_step6912
│   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   └── zero_pp_rank_0_mp_rank_00_model_states.pt
├── latest
~~├── result.bin~~
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.model
├── trainer_state.json
├── training_args.bin
└── zero_to_fp32.py

I loaded the model as below.

from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("./output_my_own_data/7B/tokenizer.model")
model = LlamaForCausalLM.from_pretrained("./output_my_own_data/7B/result.bin")

print()

The tokenizer is loaded, but the model has an error.

Exception has occurred: OSError
It looks like the config file at '/home/sulki/project/stanford_alpaca-main/output_tawos_34/7B/result.bin' is not a valid JSON file.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

  File "/home/sulki/project/stanford_alpaca-main/inference copy.py", line 4, in <module>
    model = LlamaForCausalLM.from_pretrained("/home/sulki/project/stanford_alpaca-main/output_tawos_34/7B/result.bin")
OSError: It looks like the config file at '/home/sulki/project/stanford_alpaca-main/output_tawos_34/7B/result.bin' is not a valid JSON file.

Which part is the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant