We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA_VISIBLE_DEVICES=1 python src/train_bash.py --stage sft --do_train True --model_name_or_path /home/wzb/yan/Meta-Llama-3-8B-Instruct --finetuning_type lora --template default --flash_attn auto --dataset_dir data --dataset alpaca_gpt4_zh --cutoff_len 1024 --learning_rate 5e-05 --num_train_epochs 3.0 --max_samples 100000 --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 0 --optim adamw_torch --report_to none --output_dir saves/LLaMA3-8B/lora/train_2024-04-25-15-45-51 --fp16 True --lora_rank 8 --lora_alpha 16 --lora_dropout 0 --lora_target q_proj,v_proj --val_size 0.15 --evaluation_strategy steps --eval_steps 100 --per_device_eval_batch_size 2 --load_best_model_at_end True --plot_loss True 训练过程曲线正常,评估时,loss和学习率均为0 日志如下: [INFO|trainer.py:2057] 2024-04-25 17:21:02,619 >> Number of trainable parameters = 3,407,872
[INFO|callbacks.py:145] 2024-04-25 17:26:24,858 >> {'loss': 1.6616, 'learning_rate': 5.0000e-05, 'epoch': 0.00}
[INFO|callbacks.py:145] 2024-04-25 17:31:29,975 >> {'loss': 1.6778, 'learning_rate': 5.0000e-05, 'epoch': 0.00}
[INFO|callbacks.py:145] 2024-04-25 17:36:14,088 >> {'loss': 1.6495, 'learning_rate': 5.0000e-05, 'epoch': 0.01}
[INFO|callbacks.py:145] 2024-04-25 17:41:43,950 >> {'loss': 1.6450, 'learning_rate': 4.9999e-05, 'epoch': 0.01}
[INFO|callbacks.py:145] 2024-04-25 17:46:40,556 >> {'loss': 1.6643, 'learning_rate': 4.9999e-05, 'epoch': 0.01}
[INFO|callbacks.py:145] 2024-04-25 17:52:00,261 >> {'loss': 1.5792, 'learning_rate': 4.9998e-05, 'epoch': 0.01}
[INFO|callbacks.py:145] 2024-04-25 17:57:28,697 >> {'loss': 1.6068, 'learning_rate': 4.9998e-05, 'epoch': 0.01}
[INFO|callbacks.py:145] 2024-04-25 18:03:06,412 >> {'loss': 1.5097, 'learning_rate': 4.9997e-05, 'epoch': 0.02}
[INFO|callbacks.py:145] 2024-04-25 18:08:30,067 >> {'loss': 1.5304, 'learning_rate': 4.9996e-05, 'epoch': 0.02}
[INFO|callbacks.py:145] 2024-04-25 18:14:13,946 >> {'loss': 1.4504, 'learning_rate': 4.9995e-05, 'epoch': 0.02}
[INFO|callbacks.py:145] 2024-04-25 18:19:42,074 >> {'loss': 1.5191, 'learning_rate': 4.9994e-05, 'epoch': 0.02}
[INFO|callbacks.py:145] 2024-04-25 18:25:20,252 >> {'loss': 1.3741, 'learning_rate': 4.9993e-05, 'epoch': 0.02}
[INFO|callbacks.py:145] 2024-04-25 18:31:06,083 >> {'loss': 1.3848, 'learning_rate': 4.9991e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 18:36:26,379 >> {'loss': 1.4360, 'learning_rate': 4.9990e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 18:42:14,617 >> {'loss': 1.4140, 'learning_rate': 4.9989e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 18:46:53,814 >> {'loss': 1.4261, 'learning_rate': 4.9987e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 18:51:51,722 >> {'loss': 1.4251, 'learning_rate': 4.9985e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 18:56:52,623 >> {'loss': 1.4112, 'learning_rate': 4.9983e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 19:01:55,584 >> {'loss': 1.3598, 'learning_rate': 4.9982e-05, 'epoch': 0.04}
[INFO|callbacks.py:145] 2024-04-25 19:07:13,635 >> {'loss': 1.4213, 'learning_rate': 4.9980e-05, 'epoch': 0.04}
[INFO|trainer.py:3614] 2024-04-25 19:07:13,638 >> ***** Running Evaluation *****
[INFO|trainer.py:3616] 2024-04-25 19:07:13,638 >> Num examples = 7323
[INFO|trainer.py:3619] 2024-04-25 19:07:13,639 >> Batch size = 2
[INFO|callbacks.py:145] 2024-04-25 21:25:17,654 >> {'loss': 0.0000, 'learning_rate': 0.0000e+00, 'epoch': 0.04}
[INFO|trainer.py:3305] 2024-04-25 21:25:17,655 >> Saving model checkpoint to saves/LLaMA3-8B/lora/train_2024-04-25-17-18-36/checkpoint-100
No response
transformers
The text was updated successfully, but these errors were encountered:
是 log 显示的问题,不影响训练
Sorry, something went wrong.
No branches or pull requests
Reminder
Reproduction
CUDA_VISIBLE_DEVICES=1 python src/train_bash.py
--stage sft
--do_train True
--model_name_or_path /home/wzb/yan/Meta-Llama-3-8B-Instruct
--finetuning_type lora
--template default
--flash_attn auto
--dataset_dir data
--dataset alpaca_gpt4_zh
--cutoff_len 1024
--learning_rate 5e-05
--num_train_epochs 3.0
--max_samples 100000
--per_device_train_batch_size 2
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 5
--save_steps 100
--warmup_steps 0
--optim adamw_torch
--report_to none
--output_dir saves/LLaMA3-8B/lora/train_2024-04-25-15-45-51
--fp16 True
--lora_rank 8
--lora_alpha 16
--lora_dropout 0
--lora_target q_proj,v_proj
--val_size 0.15
--evaluation_strategy steps
--eval_steps 100
--per_device_eval_batch_size 2
--load_best_model_at_end True
--plot_loss True
训练过程曲线正常,评估时,loss和学习率均为0
日志如下:
[INFO|trainer.py:2057] 2024-04-25 17:21:02,619 >> Number of trainable parameters = 3,407,872
[INFO|callbacks.py:145] 2024-04-25 17:26:24,858 >> {'loss': 1.6616, 'learning_rate': 5.0000e-05, 'epoch': 0.00}
[INFO|callbacks.py:145] 2024-04-25 17:31:29,975 >> {'loss': 1.6778, 'learning_rate': 5.0000e-05, 'epoch': 0.00}
[INFO|callbacks.py:145] 2024-04-25 17:36:14,088 >> {'loss': 1.6495, 'learning_rate': 5.0000e-05, 'epoch': 0.01}
[INFO|callbacks.py:145] 2024-04-25 17:41:43,950 >> {'loss': 1.6450, 'learning_rate': 4.9999e-05, 'epoch': 0.01}
[INFO|callbacks.py:145] 2024-04-25 17:46:40,556 >> {'loss': 1.6643, 'learning_rate': 4.9999e-05, 'epoch': 0.01}
[INFO|callbacks.py:145] 2024-04-25 17:52:00,261 >> {'loss': 1.5792, 'learning_rate': 4.9998e-05, 'epoch': 0.01}
[INFO|callbacks.py:145] 2024-04-25 17:57:28,697 >> {'loss': 1.6068, 'learning_rate': 4.9998e-05, 'epoch': 0.01}
[INFO|callbacks.py:145] 2024-04-25 18:03:06,412 >> {'loss': 1.5097, 'learning_rate': 4.9997e-05, 'epoch': 0.02}
[INFO|callbacks.py:145] 2024-04-25 18:08:30,067 >> {'loss': 1.5304, 'learning_rate': 4.9996e-05, 'epoch': 0.02}
[INFO|callbacks.py:145] 2024-04-25 18:14:13,946 >> {'loss': 1.4504, 'learning_rate': 4.9995e-05, 'epoch': 0.02}
[INFO|callbacks.py:145] 2024-04-25 18:19:42,074 >> {'loss': 1.5191, 'learning_rate': 4.9994e-05, 'epoch': 0.02}
[INFO|callbacks.py:145] 2024-04-25 18:25:20,252 >> {'loss': 1.3741, 'learning_rate': 4.9993e-05, 'epoch': 0.02}
[INFO|callbacks.py:145] 2024-04-25 18:31:06,083 >> {'loss': 1.3848, 'learning_rate': 4.9991e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 18:36:26,379 >> {'loss': 1.4360, 'learning_rate': 4.9990e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 18:42:14,617 >> {'loss': 1.4140, 'learning_rate': 4.9989e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 18:46:53,814 >> {'loss': 1.4261, 'learning_rate': 4.9987e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 18:51:51,722 >> {'loss': 1.4251, 'learning_rate': 4.9985e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 18:56:52,623 >> {'loss': 1.4112, 'learning_rate': 4.9983e-05, 'epoch': 0.03}
[INFO|callbacks.py:145] 2024-04-25 19:01:55,584 >> {'loss': 1.3598, 'learning_rate': 4.9982e-05, 'epoch': 0.04}
[INFO|callbacks.py:145] 2024-04-25 19:07:13,635 >> {'loss': 1.4213, 'learning_rate': 4.9980e-05, 'epoch': 0.04}
[INFO|trainer.py:3614] 2024-04-25 19:07:13,638 >> ***** Running Evaluation *****
[INFO|trainer.py:3616] 2024-04-25 19:07:13,638 >> Num examples = 7323
[INFO|trainer.py:3619] 2024-04-25 19:07:13,639 >> Batch size = 2
[INFO|callbacks.py:145] 2024-04-25 21:25:17,654 >> {'loss': 0.0000, 'learning_rate': 0.0000e+00, 'epoch': 0.04}
[INFO|trainer.py:3305] 2024-04-25 21:25:17,655 >> Saving model checkpoint to saves/LLaMA3-8B/lora/train_2024-04-25-17-18-36/checkpoint-100
Expected behavior
No response
System Info
transformers
version: 4.29.1Others
No response
The text was updated successfully, but these errors were encountered: