Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

评估集上loss、学习率为0 #3457

Closed
1 task done
sly123197811 opened this issue Apr 26, 2024 · 1 comment
Closed
1 task done

评估集上loss、学习率为0 #3457

sly123197811 opened this issue Apr 26, 2024 · 1 comment
Labels
solved This problem has been already solved.

Comments

@sly123197811
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

CUDA_VISIBLE_DEVICES=1 python src/train_bash.py
--stage sft
--do_train True
--model_name_or_path /home/wzb/yan/Meta-Llama-3-8B-Instruct
--finetuning_type lora
--template default
--flash_attn auto
--dataset_dir data
--dataset alpaca_gpt4_zh
--cutoff_len 1024
--learning_rate 5e-05
--num_train_epochs 3.0
--max_samples 100000
--per_device_train_batch_size 2
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 5
--save_steps 100
--warmup_steps 0
--optim adamw_torch
--report_to none
--output_dir saves/LLaMA3-8B/lora/train_2024-04-25-15-45-51
--fp16 True
--lora_rank 8
--lora_alpha 16
--lora_dropout 0
--lora_target q_proj,v_proj
--val_size 0.15
--evaluation_strategy steps
--eval_steps 100
--per_device_eval_batch_size 2
--load_best_model_at_end True
--plot_loss True
训练过程曲线正常,评估时,loss和学习率均为0
日志如下:
image
[INFO|trainer.py:2057] 2024-04-25 17:21:02,619 >> Number of trainable parameters = 3,407,872

[INFO|callbacks.py:145] 2024-04-25 17:26:24,858 >> {'loss': 1.6616, 'learning_rate': 5.0000e-05, 'epoch': 0.00}

[INFO|callbacks.py:145] 2024-04-25 17:31:29,975 >> {'loss': 1.6778, 'learning_rate': 5.0000e-05, 'epoch': 0.00}

[INFO|callbacks.py:145] 2024-04-25 17:36:14,088 >> {'loss': 1.6495, 'learning_rate': 5.0000e-05, 'epoch': 0.01}

[INFO|callbacks.py:145] 2024-04-25 17:41:43,950 >> {'loss': 1.6450, 'learning_rate': 4.9999e-05, 'epoch': 0.01}

[INFO|callbacks.py:145] 2024-04-25 17:46:40,556 >> {'loss': 1.6643, 'learning_rate': 4.9999e-05, 'epoch': 0.01}

[INFO|callbacks.py:145] 2024-04-25 17:52:00,261 >> {'loss': 1.5792, 'learning_rate': 4.9998e-05, 'epoch': 0.01}

[INFO|callbacks.py:145] 2024-04-25 17:57:28,697 >> {'loss': 1.6068, 'learning_rate': 4.9998e-05, 'epoch': 0.01}

[INFO|callbacks.py:145] 2024-04-25 18:03:06,412 >> {'loss': 1.5097, 'learning_rate': 4.9997e-05, 'epoch': 0.02}

[INFO|callbacks.py:145] 2024-04-25 18:08:30,067 >> {'loss': 1.5304, 'learning_rate': 4.9996e-05, 'epoch': 0.02}

[INFO|callbacks.py:145] 2024-04-25 18:14:13,946 >> {'loss': 1.4504, 'learning_rate': 4.9995e-05, 'epoch': 0.02}

[INFO|callbacks.py:145] 2024-04-25 18:19:42,074 >> {'loss': 1.5191, 'learning_rate': 4.9994e-05, 'epoch': 0.02}

[INFO|callbacks.py:145] 2024-04-25 18:25:20,252 >> {'loss': 1.3741, 'learning_rate': 4.9993e-05, 'epoch': 0.02}

[INFO|callbacks.py:145] 2024-04-25 18:31:06,083 >> {'loss': 1.3848, 'learning_rate': 4.9991e-05, 'epoch': 0.03}

[INFO|callbacks.py:145] 2024-04-25 18:36:26,379 >> {'loss': 1.4360, 'learning_rate': 4.9990e-05, 'epoch': 0.03}

[INFO|callbacks.py:145] 2024-04-25 18:42:14,617 >> {'loss': 1.4140, 'learning_rate': 4.9989e-05, 'epoch': 0.03}

[INFO|callbacks.py:145] 2024-04-25 18:46:53,814 >> {'loss': 1.4261, 'learning_rate': 4.9987e-05, 'epoch': 0.03}

[INFO|callbacks.py:145] 2024-04-25 18:51:51,722 >> {'loss': 1.4251, 'learning_rate': 4.9985e-05, 'epoch': 0.03}

[INFO|callbacks.py:145] 2024-04-25 18:56:52,623 >> {'loss': 1.4112, 'learning_rate': 4.9983e-05, 'epoch': 0.03}

[INFO|callbacks.py:145] 2024-04-25 19:01:55,584 >> {'loss': 1.3598, 'learning_rate': 4.9982e-05, 'epoch': 0.04}

[INFO|callbacks.py:145] 2024-04-25 19:07:13,635 >> {'loss': 1.4213, 'learning_rate': 4.9980e-05, 'epoch': 0.04}

[INFO|trainer.py:3614] 2024-04-25 19:07:13,638 >> ***** Running Evaluation *****

[INFO|trainer.py:3616] 2024-04-25 19:07:13,638 >> Num examples = 7323

[INFO|trainer.py:3619] 2024-04-25 19:07:13,639 >> Batch size = 2

[INFO|callbacks.py:145] 2024-04-25 21:25:17,654 >> {'loss': 0.0000, 'learning_rate': 0.0000e+00, 'epoch': 0.04}

[INFO|trainer.py:3305] 2024-04-25 21:25:17,655 >> Saving model checkpoint to saves/LLaMA3-8B/lora/train_2024-04-25-17-18-36/checkpoint-100

Expected behavior

No response

System Info

  • transformers version: 4.29.1
  • Platform: Linux-5.15.0-82-generic-x86_64-with-glibc2.17
  • Python version: 3.8.18
  • Huggingface_hub version: 0.22.2
  • Safetensors version: not installed
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented Apr 26, 2024

是 log 显示的问题,不影响训练

@hiyouga hiyouga added the solved This problem has been already solved. label Apr 26, 2024
@hiyouga hiyouga closed this as completed Apr 26, 2024
@hiyouga hiyouga reopened this Apr 26, 2024
@hiyouga hiyouga closed this as completed Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved.
Projects
None yet
Development

No branches or pull requests

2 participants